DOCOHEHf EESDHE 



ED 135 801 



Th 005 718 



AUIHOB 
TITXE 

IHSIITUTIOH 

SPOHS AGENCY 

PUE DATE 
NOTE 

AVAILABLE PEOH 



EDBS PEICE 
DESCRIPTORS 



IDEHTIPIEBS 



Cronhach, lee J,; And Others 

Besearch on Classrooias and Schools: Formulation of 

Questions, Design and Analysis, 

Stanford Oniv,, Calif. Stanford Evaluation 

Conscxtius. 

Russell Sage Foundation, Hew York, H^Y,; Spencer 
Foundation, Chicago, 111, 
Jul 76 
243f. 

Stanford Evaluation Consortium, School of Education, 
Stanford University, Stanford, California 94305 
($1,00) 

HF-10.83 HC-$12.71 Plus Postage. 
Analysis of Covariance; Classroom Research; 
*Classrooas; *Educational Besearch; Hatheaatical 
Models; *Besearch Design; Research Methodology; 
♦Research Problems; Sampling; *Schools; Social 
Science Research; Statistical Analysis 
Aggregation Effects; ^Aptitude Treatment 
Interaction 



AESIHACT 

Alternative i#ays of analyzing data from Aptitude 
Treatment Interactions were examined over a two--year period. In light 
of past arguments the author maintains that the questions surrounding 
aggregation have been iadly posed and that the customary methods of 
analysis were either i^icorrect or subject to misinterpretation. 
Therefore, the majority of studies of educational effects — whether 
classroom experiments, or evaluations of programs ox surveys — have 
collected and analyzed data in ways that conceal more than they 
reveal. The established methods have generated false conclusions in 
many studies. Further, the traditional research strategy which pits ' 
substantive hypotheses against a null hypothesis and requires 
statistical significance of effects can rarely be used in educational 
research. Samples large enough to detect strong, but probabilistic 
effects are likely to be prohibitively costly. (Author/flV) 



* Documents acquired by ERIC include many informal unpublished ^ 

* materials not available from other sources. ERIC makes every effort * 

* to obtain the best copy available. Nevertheless, items of marginal * 

* reproducibility are often encountered and this affects the quality ^ 

* of the microfiche and hardcopy reproductions ERIC makes available * 

* via the ERIC Document Reproduction service (EDRS) . EDRS is not * 

* responsible for the quality of the original document. Reproductions * 

* supplied by EDRS are the best that can be made from the original. * 



ERIC 



o 

oo 



E 



OCCASIONAL PAPERS OF THE STANFORD 



Evaluation 
Consortium 



Stanford University, Stanford, California, 94305 



00< 



■ 1 

I* 



o 



O $ OCPARTMENTOP HEALTH 

NATIOWALINSTITUTfiOF 
EDUCATION 

5ENT OFFICIAL L/V.^E!**'*'*-^ «0f»R£* 



5 P€RM»SSIC^ or THE COPVmSHT 



2 



RESEARCH ON CLASSROOMS AND SCHOOLS; 
FORMULATION OF QUESTIONS, DESIGN, AND ANALYSIS 

Lee J. Cronbach - - ^ 
with the assistance of 
Joseph E. Deken and Noreen Webb 



July, 1976 

The Stanford Evaluation Consortium is a group of faculty members and 
studonts concerned with the improvement of evaluation of educational 
ond social'-service programs. The Occasional Papers represent the views 
of the authors as individuals. Comments and suggestions for revision 
arc invited. The papers should not be quoted or cited without the 
specific permission of the author; they are automaticallv superseded 
upon torntal publication of the material. 



Additional copies of this paper 
Stanford Evaluation Consortium, 
Stanford, California 94305. 



arc available for $1 
School of Education, 



.00 t^ach 
Stanford 



from the 
University, 



Table ot Contents 

Preface i 

1. Introduction to the problem 1.1 

From a statistical issue to a substantive issue 1,1 

The Abt Follow Through report 1.3a 

The need to disentangle effects 1.9 

Psychological bias and sociological bias 1.14 

A debate in educational psychology 1.15 

Debates within sociology 1.16 

Units of analysis? of treatment? of theory? 1.18 

Units within hierarchies 1.22 

Units in areal analysis 1.23 

2. Units in various research contexts 2.1 

The problem as seen in research on ATI 2.1 

Sample size for regression analysis 2.2 

The Maier- Jacobs study 2.3 

Harvard Project Physics 2.5 

Three kinds of process 2,6 

Aggregation effects 2.11 

Ecological psychology 2.16 

Evaluative studies and school-effect studies 2*17 

Extrapolation in interpretation 2*23 

^* A m^^thematical model 3.1 

Definition of components 3.1 

Interpretation of components 3 .4 

Partitioning variance 3.8 



4 



ERIC 



Table of Contents 2 



Choices made in forming the model 

Direction of decomposition 

Nonlinear ity 
Effects of aggregating data 

Special case with linear assumption 

Meaning of ^ 

Comparing 6, and B 

D w 

Implication for ATI research 

Aggregation effects with multiple discriminants 

The reference population and its paraineters 

Alternative models for statistical inference 

Collectives distinct, persons fixed 

Collectives nested within local populations 

The independence assumption 

Choice among models 
Weights that define parameters 

Illustrative statistics for populations of collectives 
Head Start 

School districts in California 
A problem of estimation 



3.9 

3.9 

3.10 

3.11 

3. 12 

3,15 

3.16 

3.21 

3.22 

^.1 
^.1 
^.2 
4.3 
A. 4 
4 .5 
4 .7 
4.11 
^ .11 
4 .12 
4 .13 



5. 



ERIC 



Illustrative ATI studies 
The Anderson study 

A weighting decision 
Regressions of ZACH on PRECOM 
Regressions of ZACH on ABIL 
Cooperative Reading data 
Plan of the studies 



5.1 
5.1 

5.2 

5.3 

5.8 

5.11 

5.11 



Table of_Contents 3 

The original analysis across projects 5,12 

Half-class as unit of analysis? 5.13 

The original analysis within projects 5,14 

- — Procedures in our reanalysis 5.16 

Results of our analysis 5^18 

Conclusions regarding units of analysis 5.22 

Head Start Planned Variation 5^23 

Disattenuating regression slopes 6 i 

Within- and between-groups reliabilities 6.2 

Case I g 3 

- -» 

Case II 6 5 

Case III 6^5 

General remarks g g 

Statistical inference j 2. 

Sampling error of a mean J2 

The between-groups regression 7^3 

The within-groups regression 7.4 

Analysis of covariance g 

Design 1. Collectives nested in treatments 8.2 

Alternative adjustments 8.3 

Cooperative Reading data g^g 

Follow Tlirough data 8^9 
Design 2. Treatments crossed with blocks; collectives nested 8.10 

The plan of the Head Start study 8.11 

Alternative analyses, assuming homogeneity of regression 8.13 
Alternative analyses, recognizing heterogeneity of 

regressions g^j7 

e 



9. Multivariate considerations 
Simple correlations 

Correlations of reading outcomes 
Component analysis and factor analysis 

Analysis of correlations 

Factoring covariances 

Slatin's analyses 

Suggestiony 

Empirical test construction 
Multiple regression and related techniques 
A school-effects model 
Partialling 

10. The Road Ahead 



Table of Contents 4 

9.1 

9.1 

9.2 

9.6 

9.11 

9.13 

9.15 

9.16 

9.19 

9.20 

9.20 

9.22 

10.1 



7 



2 



It will be evident from the physical form of the paper that it 
is a draft still undergoing revision. 

I have decided to distribute it in this form because of my 
conviction that these issues are of vital importance and that it would be * 
counterproductive to delay the discussion until my argument is polished. 
Millions of dollars are going into evaluation studies each year; it would 
be a sufficient short-run contribution to persuade sponsors and investi- ^ 
gators to think hard about the questions raised here. I have not resolved 
the problems presented by aggregate phenomena; it is my intention here to 
stir up debate and to encourage proposals from others. I invite — nay, 
beseech — comments and counterarguments from those who receive this paper. 

This project originated out of a concern with Aptitude x Treatment 
interactions. The procedure in ATI research is to calculate an outcome- 
on-aptitude regression for some teaching method, with the intent of 
discovering and explaining differences in regression slopes for alternative 
methods. During the same period I became involved in theoretical aspects 
of analysis of covariance. In that analysis, regression slopes have been 
regarded as instrumental rather than as of primary interest. As the Abt 
example in Section 1 shows, the issue of units of analysis arises there 
also. Thirdly, Leigh Burstein has completed a doctoral dissertation on 
the aggregation problem, as seen particularly in studying regressions 
calculated in educational sociology. I have served as chairman of his 
dissertation committee and find that experience influencing my thinking here. 

My two assistants, Joseph Deken and Noreen Webb, played an important * 
role in developing materials for this monograph. Miss Webb took primary 
responsibility for the illustrative data analyses and the data-processing 
methods, and Mr. Deken led the way in the statistical theorizing. Neither ^ 
of them is to be held responsible for the present content of the paper. 
Analyses of California Assessment data were made by David Rogosa, under 
support from the State, and Lynne Gray assisted in analyzing the Featherstone 
data. David E. Wiley was good enough to work through the entire manuscript 
and did much to correct and extend my thoughts. Conversations with 
Leigh Burstein, Merrill Carlsmith, Dan Davis, Mike Hannan, and David Rogosa 
were helpful; likewise, I thank Dudley Duncan, Robert Hauser, and Herbert 
Wa]berg for suggestions in correspondence. 

The revision of the report prepared for the Spencer Foundation and 

its reproduction and distribution was supported by the Stanford Evaluation 

Consortium under a grant from the Russell Sage Foundation. 
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1. Introduction to the Probl 
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From a statistical issue to a substan tive issue 

If a contour map of the region from Maine to Maryland were 
prepared with a six-inch contour interval, a person inspecting the 
map would not perceive the Atlantic Ocean. Some such remark 
C was the lead sentence of an article in Science some years back. 

I have been unable to locate the article pnH 

ci.n.icxe, and the contour mentioned 

may have been six feet, not six inches. However the writer phrased 
it, the point is that a fine-grain analysis can overlook large 
configurations in the data. It is equally true that too gross an 
analysis can conceal important relationships. Nor are all sins 
those of omission; investigation on the wrong scale can positively 
distort relations. 

It is conventional in psychology and biology to regard the 
single organism as the object of investigation, and educational 
research workers took over that point of view. They frame hypotheses 
in terms of individuals and base their analysis on individual scores, 
though they do not always make the person the sampling unit. Only subsequent to 
the Coleman report of 1966 did the rise in sociological and economic 
research make hypotheses at the level of collectives common in education. 

The habits of the psychologist and biologist do not fit research 
on classroom instruction. Rats receiving a 
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drug or placebo are properly considered to be independent subjects; what one 
rat does has no effect on the score of the next (unless the experimenter 
somehow introduces correlated errors). Students in a class, however, do not 
provide independent evidence • 

Typically, persons within a class are more alike at the outset of 
instruction than persons randomly sampled from the relevant population. 
Certain adventitious common experiences during instruction depress or raise 
the scores of many of them — a flu epidemic, or perhaps a wave of enthusiasm. 
What the Cu.ass experiences goes beyond the treatment specified by the experi- 
menter. There are unintended treatment"^ variations in the experiment on rats 
also, but the design tries to ensure that no two rats experience the same 
variation. In the classroom where variation is common to members of the 

class, the entire class provides a single observation on the effect of the 
writings (e,g,, Peckham, et al , , 1969) 

treatment. Some on educational statistics warn against taking the 

nu-oer of students in the experiment as the basis for evaluiting degrees of 

freedom, on the grounds that this gives an unjustifiably small estimate of 

the sampling error. The warning they do not give is that analysis on indi-- 

viduals often looks at the wrong question. 

There is a literature in sociology that warns of the importance of 

choosing the right unit, and most of that literature too has perceived the 

question as one of analytic procedure. 

C Sociologists (and political scientists, economists, etc.) 

work with censuses and other public records compiled for some aggregate unit 

such as a county or an industry* If one wants to know how reliance on public 
level of 

libraries relates to^education, he may find circulation figures available for 
each local library system, and educational statistics available for census 
tracts. Then by combining census tracts that more or less ^latch the service 
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area of the library system, he is enabled to correlate book circulation with 
Rducation. A famous paper by Robinson (1950) -- echoing E. L. Thorndike 
(1939) — warned against interpreting such correlations as if 
they described relations at the individual level; over the years, a rather 
sizable literature has reiterated or modified Robinson's warning. I shall 
later review some of the current ideas,. but I do not concern myself with 
the problem of unavailability of data, which motivated much of the original 
work. 

The great majority of sociologists who deal with data at two levels have 
carried out essentially the same analysis at two levels, or have mingled 
measures on units and subunits in the same calculation. Thus, a sociologist 
who had full data might enter into the same regression equation an indivi- 
dual's use of the library, his education, and a measure of the size of the 
comii.-anity library, assigning that same value to all residents of the community. 
The statistical analysis then pools all the individuals, without regard t . 
community boundaries. 

1 shall propose that in a large class of educational studies, 

and probably in -lany other studies of social services, the more reasonable 
analysis is to relate variables within groups (schools, communities), and 
then to analyse group-level variables across groups. UTiereas most sociologists 
have related Y to X and to the group mean X, I propose to relate Y to X, and Y - Y 
to X - X. This reformulation changes the Gestalt of substantive findings. 

Only recently have sociological writers pointedly recommended separate 
examination of between-group and within-group statistics, though casual 
references to or illustrations of such analyses appear here and there in 
the literature. Alwin (1975) has now recommended this decomposition as a 
superior way to examine composition or context effects, and Firebaugh's 
(1975) theoreticdl paper on aggregation bias appears to be in close 
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harmony with this report, insofar as it overlaps. As far back as 1969, 
Slatin analyzed relations of delinquency to other variables within areal 
units of a community, and between areal units. In one of the most recent 
studies of "school effects*', Hauser, Sewell, and Alwin (1974) make use of 
an overall regression analysis, a within-schools analysis, and a between- 
schools analysis; but they use the analyses to examine somewhat different 
aspects of the data. Minkowich, Davies, & Bashi (1976) have analyzed the 
"little Coleman report" on Israeli schools by means of a systematic separation 
of between-school and wi thin-school relationships. 

In writings on "the 
aggregation problem" or "the units-of -analysis problem", the investigator 

is presumed to be interested in how one variable depends on another (which 
may or may not have been manipulated). Writers prior to Firebaugh have 
discussed whether analyzing means of classes or other groups of subjects is 
an acceptable substitute for analyzing scores individually, and vice versa. 

In these discussions, the variables at the group level are "aggregates" 
of measures on individuals. The choice of unit of analysis 

has a considerable effect on correlations, regression slopes, and 
within-treatment variances, though it has little or no effect on 
unadjusted within- treatment meaiB , 

the unit of analysis can make a difference in the estimate of 
a covariate-ad jus ted treatment mean, when persons or classes have not 
been assigned to treatments at random or when the number of independent 
assignments to treatment is small. 

IhS.J^hlJono^Uhlou^L^^^ The confused state of the art 
and the Importance of the problem are displayed in the analysis of 
Follow ThroujA. Planned Variations data by Abe Associates (Cline, 1974). 
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The difficulties recognized there and the difficulties not recognized there 
foreshadow most of my concerns in this report. In this multimillion-dollar 
study, over a dozen sponsors set up FT groups, each group using whatever 
model of compensatory education the sponsor advocated alongside an NFT 
control. Control schools were only roughly comparable to the experimental 
schools the sponsor used. Comparability of samples across and within spon- 
sors was so poor that Abt analyzed data of each sponsor as a separate quasi- ^ 

an 

experiment. (I consider this sound; I have doubts about^overall analysis in 
an appendix that considers sponsors simultaneously.) The fact that the Abt 
consultants included methodologists prominent in educational evaluation 

leads me to think that the analysis does reflect the state of the art. 

Three analyses were considered: individual, class, and school 
(each within sponsor). The individual analysis started, in effect, by 
punching one card with the data for each child. The sponsor's whole batch 
of FT and NFT children was then run through an ancova program, to reach a 
number described as the adjusted treatment effect. The school analysis 
was the same, except that there was one card per school, with pupil 
averages on variables replacing individual scores. The class analysis 
was similar, with one card per class, all classes within a 

treatment being pooled in the analysis. 
Abt used 

different variable sets for the three analyses. Abt chose 18 covariates, 
but only 11 of these entered the individual analysis and 12 entered the 
school analysis. The global variable Southern/Western/Other Region, 
for example, could have been punched in the cards for pupil and class, 
if it was worth considering at the school level. The school-level 
riggregate Percent Minority could have been represented at the individual 
level by a Minoricy/Nonminority code. (But very likely Abt was 
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not supplied that datum*) Conversely, the individual variable Preschool Exp,/ 

No Preschool Exp, could (and I think should) have reappeared as an aggregate. 
(Some part of the differences in findings from the three analyses arose 

because data were missing,. E.g,, children counted in the class 
aggregate on certain vari?h es were omitted from the individual analysis; 
because their scores were incomplete. Loss of data is a compli- 
cation, but probably not the main source of confusion,) 

In my opinion Abt was correct to emphasize the school level in its Summaries, 
Treatments were assigned to schools, and no doubt program delivery 

varied from school to school, Abt^ however, feeling that the rationale for 

choosing a level was weak, offered the three analyses as a "cross- 
validation". It is not, of course, anything of the kind; the three analyses are 
in no way independent, and they ask different questions, Abt did indeed state 
that the analyses asked different questions, and that an aggregate variable is 
a different variable from the disaggregated variable that generated it. And 
yet, said Abt, if the three analyses give similar estimates of the treatment 
effect, the result can be accepted with "enhanced confidence". To be sure, if a 
critic is disappointed by the finding that the school-level analysis reports, he 
may claim that analysis at some other level would give the result he would like; 
presenting all three analyses disarms such a critic. But it is a mistake to re- 
gard the three analyses as equally relevant and equally legitimate. 

The results of the three analyses did not agree. Sponsor 3, Arizona 
(pp. VI-66ff.) provides a striking example. Let us confine attention to 
effects on the Wide Range Achievement Test (WRAT) . The first glaring discrep- 
ancy appears in the unadjusted treatment effect. With the posttest mean 
expressfd in raw units, the differences (FT minus NFT) are 

Pupil Mean diff. = +1.37 N ^ 317 FT, 265 NFT Pooled s,d - 12.8 t = not given 
^•'^'^^ = 3.07 - 38 FT, 26 NTT - 7,2 t = not given 

i-n^/^"''"^ " ""^'^^ " 20 FT, 21 NFT = ,8 t = -.6 

hHJC 
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It is to be expected that s.d.'s will be larger at the individual level. 
It might be expected that means and mean differences will be the same 
except for such perturbations as missing cases introduce, but they are not. 
(Discrepancies seem much larger when the Abt charts 

display each mean difference divided by the corresponding s.d.!) My only 
guess as to the reason for the discrepant means is that the one-card-per- 
class and one-card-per-school techniques of calculation weighted 
cases differently. 

I agree with the decision to calculate _t at the school level only. 
I do not agree with the decision to test the unadjusted difference, however. 
Adjustment changes the picture. The mean differences become 

Pupil Mean diff . - 0.36 Change - -1.01 

Class « -1.47 « -4.54 

^^hool « 0.24 t « 1.37 = 0.39 

The adjustment, then, reduced the effect in two analyses, as it should if 
the FT sample was superior to the NFT sample at the outset. But it 
increased the effect at the school level, which could only happen if the 

NFT sample was better at the outset or the regression slope positive 

at the class level — changed to negative at the school level! At least 
one of the adjusted analyses must be seriously wrong. In fact, it can be 
argued that none of them is of much value. The pupil -level analysis and 
probably the class-level analysis are theoretically inappropriate; and the 
number of classes or of schools is too small to determine adequately the 
regression coefficient on which the adjustment is based. 

Aot wisely did not test significance at the two lower levels. 

Many if not most investigators would have done so, if only because the 

2 

larger N promises a higher significance level. The most serious question 
to be raised about Abt's significance test is whether it is meaningful. 
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In a generalized regression analysis loosely comparable to analysis of 
covariance, 12 regression coefficients plus two constants were fitted to the 
41 school means on the covariates. Multiple-regression coefficients are 
notoriously unstable in small samples; 

if the coefficients change, the adjustment is likely to change dramatically 

when the groups are dissimilar to begin with. It is advisable in general 

to distrust any one regression coefficient when predictors are correlated, 

even when samples are large. The treatment effect in this study is literally 

calculated as a thirteenth regression coefficient. i suspect that in a 

quasiexperiment like this uncertainty regarding the adjusted treatment 

effect in the population is much larger than the conventional significance 
test indicates. 

The Abt group went one step beyond ancova. Theirs is one of the rare analyses 
that ta^es seriously the many warnings in the statistical and psychological lit- 
erature about fallible covariates. The fallible covariate most likely underadjusts, 
hence disattenuation is vital in a nonrandom experiment. Abt does disattenuate 
the adjusted treatment effect in the pupil-level analysis and so arrives at one 
final "true score adjusted treatment effect". A value of 0.09 replaces the pupil- 
level "adjusted effect" of 0.36 for WRAT with Sponsor 3. (In other instances, the 
change is sometimes an increase and is sometimes a change of sign.) 

How Abt disattenuated is a trystery. Abt correctly states that the only 
sound correction trethod available in 197A was limited to the study with a 
single covariate. Yet the analysis they disattenuated was a multiple 
regression with several fallible covariates. It seems likely that they used 
one of the unacceptable techniques in circulation in early 1974. 

Cronbach, Rogosa, Floden, and Price (1976), building on an unpublished paper 
of Keesling and Wiley, have recently put forth a correction for the multivariate 
case, Abe might, of course, have hit upon this method* 

In any event, the point to be trade here is that aggregate data again 
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spawn confus jn. Abt corrected only the individual analysis, arguing 
that class and school data are much more "stable" and in need of no 
correction. Later we shall see, however, that group regressions may be just 

as fallible as individual ones. The standard error of measurement of 

a group mean (with pupils fixed) is small, but 

the coefficient of generalizability for the group means (which enters the 
disattenuation formula) may be lower than that for individual data. Class 
and school analyses of covariance ought to be disattenuated when assignment 
Is not random. ^ 

The need to disentangle effects . Only 
chaotic debate can result from program evaluations in education until the 
present confusion about units of analysis is dispelled. The issue is not 
really one of ir<^er. from sample to population, as the infrequent 
treatment of the x^due in statistics texts suggests. And it is not 
usually one of "substituting" analyses of aggregates for analyses of 
individuals. Conflicting if not wholly incorrect descriptive results 
in the Abt sample are the root source of confusion. 

Analyses at the group level and the individual level give conflicting 
descriptive results because they bear on different substantive 'questions . 
The investigator who "wants to know the relation between two variables" is 
not asking a clear question until he tells whether the group or individual 
relation is the one of interest. The investigator who proposes to partial 
out certain influences has to specify which relations he 
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intends to remove — and he had better know why! Some social scientists 
have recognized that the problem is less one of choosing the right analysis 
and more one of asking the right question (Dogan & Rokkan, 1969). 

Scheuch's (1966) exposition — of how the choice of unit depends upon the 
theoretical question in hand, and of how the evolving theory takes shape 
and power from the choice of units once it is made — is outstandingly 
complete and eloquent. ^^^^ Scheuch is concerned with the 

choice of units, instead of with the problem of separating between-groups 
from within-groups effects. Duncan, Featherman, and Duncan (1972) do 

have a clear discussion of what is to be gained from such separation, an 
argument faintly foreshadowed in the marvelously lucid pioneering work 
of Duncan, Cuzzort, and Duncan (1961). 

^ Insofar as relevant experiences are associated with groups 

there are two matters to consider: between-groups relations and within-group 

relations. The overall individual analysis combines these, to everyone's confusion, 

A distinction between aggregate and global data is sometimes 
made (but not in a consistent way). I shall define an aggregate datum 
as a simple composite (count, average) of individual 
characteristics such as per capita income, sex ratio within a school, 
mean reading level, or percentage of dropouts. Global characteristics are those 
associated with the collective that are not operationally divisible 
over individuals, e.g., the per-pupil school budget, the age of a 
school principal, the size of the school library, the fraction of 
meetings of a class that are devoted to discussion. A count of a 
characteristic on which individuals do not vary within groups (e.g., 
population in an area! unit; sex in sex-homogeneous intact groups J is 
classed as a global property. The distinction Is unimportant, sinc^ the 
two kinds of variables are to be analyzed in exactly the same way. The 
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only real difference is that aggregate variables confuse interpreters, 

^ho are inclined to regard the aggregated and disaggregated 

data as alternative representations "of the same variable*'. Except in 
pretest measures on newly assembled groups, they are not (see below). 

The interplay between aggregate and individual phenomena can be 
illustrated by considering the proportion of college-educated 
in a community. An industry needing an educated labor force is 
attracted to the area. Then the probability that a person will work in 
this industry is not merely a function of his individual level of education; 
it is a function of the educational level in the area where he resides. 
Causality ir equivocal ^ since the industry, once established, 

attracts pe* ^ with suitable education into the area. 

Thi3 example draws attention to a point insufficiently emphasized in 
the literature on aggregation effects. 

The aggregate variable often represents a different construct from the 

individual- level variable. A particular relationship might happen to 

have the same form and parameters at both levels, but even if both 

relations were described by (say) Y = 2X + 3, the relations are rarely 

"the same". The aggregate X and the individual X are different variables; 
3 

ditto for Y. That the individual is college educated indicates a good deal 
about what he would be inclined to purchase or what jobs he would be 
capable of holding. The aggregate college education in the community 
not only describes an aggregate market and an aggregate employee poolj 
it says a good deal about what goods and services probably are well 
supplied in the community (pediatricians? art movies? books? brokerage 
offices? etc.), and a good deal about the kinds of jobs offered- The 
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.^gr -ate construct enters Into a network of relations describing propertie.i 
of groups (global as well as aggregate properties). It is true that a 
college graduate is more likely to live in a community where the proportion 
of college graduates is high. But inference from his individual education 
to the probability that a choice of pediatricians is available to him is a 
weak inference, mediated first of all by the characteristics of the group. 
His individual probability of knowing of multiple pediatricians — when they 
are In the community — does depend on his own education. Instead of consid- 
ering group and individual relations as alternative 

versions of the sane information, I propose to regard then as statements 

=ibout different variables, even when the variables originate in the same 
3a 

operation. 

In educational research, practical considerations sometimes suggest 

th^t one level is more relevant than the others. The State of California, 

for e.:ample, conducts a testing program whose main function is to inform 

local district boards how adequate the achievement of pupils in their 

school system is. The district mean in . reading is presented 

alongside a regression estimate of the expected reading mean. In 

-971-73 (for example), the variables given 

greatest weight in predicting Grade-6 reading in unified school districts 

;.'ore an index of family poverty, per cent college educated, and per cent 

Spanish iiirnained. These variables were all aggregated to the district 

ifcvtl, and districts were taken as the unit of analysis. This Is logical. 

The State also reports scores school by school, and compares the school 

score to a regression estimate of each expected school mean. 
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There is no a priori reason for the raw-score regression weights for dis- 
tricts to give just predictions at the school level. The State might 
form a school-level regression equation, entering 

all the individual schools into the calculation. But this is 
less logical than a two-step operation that predicts the district mean 

and the school's deviation from the district mean. The procedure permits assigning 
one weight to per-cent-college-educated at the district level, and another 
weight CO the school percentage expressed as a deviation from the district 
percentage (and perhaps a third to the product of the two). 

"Choose the one unit that fits the decision" is an inadequate rule. 
In a seminar discussion of this report one person suggested that when policy 
makers want information at (say) the school level, this immediately settles 
the question of units of analysis. I 

do not think so; analysis with '^school as unit'* is not the same as analysis 
of districts and schools within district. But, in a hierarchical analysis, 
the results at two or more levels can be packaged into a statement that 
addresses the question in the decision-maker's mind. 

Another exc^mple comes from evaluation research. Suppose that an 
educational innovation will be 

installed — if at all ~ on a school-wide basis. To decide 
for or against it one may need to know how student ability influences 
outcomes. The question can be posed in 

terms of individual or school characteristics (e.g., the mean ability score 
of the student body, the range of ability scores). The administrator's 
question appears to be, In the presence of what school characteristics 
does this innovation provide cost-effective results? Only if there is 
a live possibility of reassigning student., among schools, or of assigning the 
students within the school to different treatments, does the decision about 

ERiC ^^^^ individual differences,^ 
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, , to analyze so as to disentangle 

This paper examines how ^ ^ 

effects at two (or more) levels and how to interpret both sets (or all sets) 
of findings. 

Psychological bias and sociological bias 

As Matilda Riley (1963, pp. 707ff.5'/ said, it is natural for 
psychologists to think in terms of individuals and for sociologists 
to think in terms of collectives . Not only is the psychologist's 
theory in that form, but the experimental tradition has 

always looked on rhe single animal or the single human subject as a 

at Experimental 

biological organism responding to an objective, manipulable world.^ research, 
even in social psychology, has consistently formulated propositions about 
a condition that can be imposed "uniformly" on all subjects, as if they ^ 
were being run one by one in an experimental cubicle. This language has 
been carried over into evaluation studies and research on classroom learning. 

In psychology, units of analysis have received appreciable attention 
only in connection with laboratory studies of learning. A number of papers 
(e.g., Estes, 1956) have discussed the fact that "group learning curves" ~ 
i.e., curves fitted to group averages on successive trials — have little 
in common with individual learning cur^^es. In particular, a group curve 

showing gradual learning may actually be a composite of individual 
each of 

curves, in^hich "sudden" learning occurs. Insofar as this discussion 

has been influential, it has reinforced the psychologist's wish to avoid 
aggregation. 

Scheuch (1966) discussed similar individualist and collectivist 
biases as they have appeared in economics (and, incidentally, in 
political science). The attempt to develop theory by combining 
individual preference or demand functions appears to be the exact 
counterpart of the psychologist's attempt to combine individual 
learning curves, save that combining works out badly for the 
Q "chologist and analysis at the individual level works out 

ERLC . 
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A debate in educational psychology^ The conflict between this 
orientation to the individual and an orientation to the group 
seems first to have been aired 

in an educational context in 1967 (Wittrock & Wiley, 1970, pp. 271 ff.). 

In that symposium on evaluation David Wiley stated that the appropriate unit 

of study in educational evaluation is the collective — class or school — 

rather than the individual. (Today, he vould not emphasize one unit to the 
exclusion of the other.) Wiley was challenged by Benjamin Bloom, who 
insisted that it is pupils the school teaches. Pupils react as individuals, 
and the effects on them should be the focus. The instructor and psychologist. 
Bloom protested, are too often pressed to investigate the wrong question just 
because it fits into a rationale the methodologists find comfortable. 
Wiley properly retorted that he had been speaking as a substantive specialist 
on education, not as a statistician. Upon saying that, he was attacked 
by Robert Glaser for "ignoring the existence of a discipline called the 
experimental psychology of learning". 

Glaser judged it inappropriate to 
seek conclusions about classrooms. Effects in the classroom are an 
aggregation of effects of environmental arrangements on individuals. 
With a sufficient understanding of the laws of individual learning 
as compiled in experimental psychology, one would be 

ready co design environments. A bit i;t:er Glaser said, in echo of Bloom: 
"It is still true that no one has ever taught a class. You teach an 
individual in the context of a class, but no one has ever taught a class. 
It is impossible to teach a class. You teach individuals whose behavior 
changes.... The class is a convenient artifact so that the teacher can 

reach one student." ABain<s»- t-hi^. 

Against this we can place one of Wiley's final remarks, 
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pregnant for this report: "When we talk about the effects of a treatment 
on the classroom, we are talking about something fundamentally different 
from the effects of the treatment on the individuals in the classroom." 

Glaser's position does not appeal to be tenable. in 
principle, an adequate account of the laws of learning at the individual 
level would indeed predict response to any environment, just as in 
principle an adequate understanding of physical forces at the molecular 
level would account for the durability of a bridge. The laws that 
describe learning, however, have to be interactive laws that take into 
account both the characteristics of the individual and of the setting 
(Cronbach. 1975). Many of those interactions (e.g.. effects on the 
Student of the abilities of the 

other group members) can only be studied in the group context. That is 
to say, parameters describing the group have to be written into the 

"laws of learning." Such relations can only be detected through 
research on groups of particular kinds (Putnam, 1973). 

Debates w ithin sociology. Just as the psychologist prefers to see 
individual causation wherever he looks, many a sociologist envisions group- 
level causal processes wherever he can. Aggregate variables have been of 
particular interest to those sociologists investigating social-psychological 
processes. The investigators at the Bureau of Applied Social Research at 
Columbia, and their disciples, have pursued studies of "context effects" 
with considerable enthusiasm. The central idea is that one's actions and 
decisions depend not only on his individual characteristics but also on 
tliose in his reference group. 

£ Aiaong reports of context effects or alleged context effects., the one 
best known to educators is that of Coleman e t a 1 . (1966). It was argued 
there, on the basis of a regression analysis, that a student's achievement 
and aspirations increase if he is in a student body that is strongly motivated. 
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Allan Barton (1968) attacked those sociologists who processed data at 
the individual level, as a prelude to a description of some of the causal 
models that could be used at the group level; 

For the last thirty years, empirical social research lias been 
dominated by the sample survey. But as usually practiced, 
using random sampling of individuals, the survey is a 
sociological meatgrinder, tearing the individual from his social 
context and guaranteeing that nobody in the study interacts 
with anyone else ,n it. It is a little like a biologist putting his 
experimental animals through a hamburger machine and 
ookiRg at every hundredth ceil through a microscope; 
anatomy and physiology get lost, structure and function 
disappear, and one is left with cell biology. 

Barton went on to point out that to reduce sampling eiror the pollster 

scatters his interviews widely and thereby loses the opportunity to look 

at behavior in, for example, neighborhood clusters.^ Representative of 

[ reports of context effects is a study by Bowers (1968) in 99 colleges. 

Students were asked, for example, if they disapproved of drunkenness and if 

(or cheating, etc.) 

chey had been drunk. The percentage of drunkenness^was crosstabulated (see 
Barton, 1970) against individual approval/disapproval, within colleges where 
(for example) the disapproval rate t is high. The persons who as individuals 
C approved were less likely to have gotten drunk if the majority 

in their college strongly disapproved, hauser (1970b) pointed out that 

Bowers was in effect entering the group mean X and the individual attitude 

. multiple- 
score X xnto a^^r^gression equation to predict behavior, and then claiming the 

positive weight for X as evidence for a context effect. 

Robert Hauser has spearheaded an opposition group within sociology. 
1971 monograph reviewed the literature to that date and challenged those 
who had tried to show context effects: 

Contextual analysis is based on a misunderstanding of 
statistical aggregation and of social process which is 
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rooted in the identification of differences among groups 
with the social, and differences among individuals with 
the psychological, [p, 13] 
Bowers* two-variable analysis is of just that character (Hauser, 1970b). 
Hauser^gy^^ on (1971, p. 46") to argue that the usual interpretation given 
the Coleman report is indefensible. Those who are conservative regarding 
causal interpretations typically refer to ''compositional effects", a term 
apparently introduced by Davis, Spaeth, and Huson (1961). 

In an oft-cited paper (1970a), Hauser challenged his fellow sociologists 
just as Wiley challenged the psychologists. Hauser contrived a demonstration 
of a context effect: thdt educational aspiration of students (within either 
sex) rises as the proportion of males in the high-school student body rises. 
For the sake of heightening the drama, Kauser went on to propose social 
policies that would hold down the proportion of females receiving a high- 
school education. Then he demolished the claim for a context effect by 
reinterpreting the global sex-ratio variable as a proxy for such aggregate 
variables as IQ and social status. The groups with high ratios also were 
higher in the proportion of high IQs and students of high status. 

Hauser 's argument is essentially about specification error. If one 

relates the dependent variable to only a fraction of the initial variables at 

the individual level that contributed directly to the effect (or that contributed 

to the allocation of persons into groups), this is equivalent to using an inadequate 

covariate to adjust scores in a quasiexperiment. Only if an ideal adjustment is 

made (Cronbach et al . , 1976) will one properly evaluate the effect of groups as such. 



27 



1.17 b 



Barton (1970) challenged Mauser's argument and Hauser (1970b) replied. 
The debate continued in a paper by Farkas (1974) and a rejoinder by Hauser 
(3974). The several papers cite earlier arguments f6r and against contextual 
interpretations. It is unnecessary to restate tho several positions, parti- 
cularly since I am advocating a kind of analysis not discussed directly by 
the others. It may be useful to restate the essence of Hauser's position as 
I understand it. The heart of the matter is a rule of parsimony; if most of 
the variance can be explained by individual-level relationships, there is no 
need to invoke a contextual explanation. Thus, where Bowers gave X and X 
equal status in his regression, Hauser considers it appropriate to calculate 
regression weights for X and X-X. Since X and X are correlated, this proce- 
dure allocates most of the predictable variance to the first predictor. (My 
proposed scheme is similar to Hauser's save that it fits weights to X and 
X-X — which equals X - X .) Hauser does not deny the possibility of causal 
effects at the group level, but he places on them the burden of proof. More- 
over, and his point is one that no writer of the 1970's would deny, any 
serious claim to a group-level causal effect ought to be supported by tracing it 
to observable intermediate processes. Simple pre-post correlations or 
regressions do not carry much weight in a discussion of causes today. 

The terminology of the sociological debate has been an unnecessary source 

of confusion. I suggest that three kinds of relations are worth distinguishing: 

i. Demographic effects. The groups examined have, as groups, 

no causal influence. But the groups differ on certain precursors 

of the outcome variable of interest. Processes at the individual 
Outcome 

level would generate^ differences between the groups. This is 
Hauser's preferred explanation for observed effects at the group 
level. One might speak of "composition" effects, but there are 
ambiguities in the term. If desegregated schooJs create outcomes 
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unlike those the same students would have had in segregated 
schools, this is a consequence of student 

body ''composition". While "demographic" is open to the same 
construction, 1 think it can serve as 1 have defined it. 

2. Group-caused effects. Outcomes for a given individual 
depend on the group he associates with or the setting in which 
his group works. This includes "context" effects that arise 
from peer influence, and also "school" effects that arise from 
particular curricular offerings or other nonpsychological causes. Insofar 
as the events in the desegregated school modify outcomes, the 

effect is "group-caused". To be sure, a new curriculum is not 
caused by the group, but it is a cause that affected the person 
because he is a member of the particular group. 

3. Arbitrary aggregation effects. The relations listed above apply 
to the study where groups are observed over a period' of time and 
changes are to be explained. Grouping is sometimes imposed on a 
body of data after the effects have been produced. This happens when 
survey data on, for example, race and unemployment are aggregated to 
the level of, say, the county. Insofar as the basis for aggregation 
correlates with either or both the variables of interest, statistics 
at the aggregate level may differ from the corresponding disaggregated 
statistics. 

As we proceed it will become increasingly evident that, from data on X 
and Y alone, it is impossible to establish which of these classifications a 
phenomenon falls into. 
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Units of analysis? of treatment? of theory? 

Abt, handed data to process, saw the question as one 
of units "of analysis". At an earlier stage in the Follow-Through evalu- 
ation, however, the question had been faced as a choice of units of design, i.e., 
of sampling and of treatment. The sponsor was instructed to identify schools in 
which he would install his FT treatment and similar schools to be NFT 
controls. This only crudely approximated a process of formal sampling 
and random assignment, but it did identify the school as the unit to which 
the treatment would be administered. (The plan actually called for treating just 
a few classes per experimental school, ' ignoring the others.) 

A structurally different decision was made in designing the Performance 
n . ■ . (^y» 1972). 

Contracting experiment^ Districts were chosen as before, somewhat arbitrarily, 

and two schools with disadvantaged pupils were selected within the district. 

One of these went into each treatment. The district was a sampling unit - 

given an intent to generalize the results into national policy. The school, 

however, was the unit of assignment, hence of treatment. 

Someone might challenge this terminology by describing a study with the 

same design where the treatment was a vaccine administered to each 

experimental student individually, with a placebo administered in the 

control school. Individual injections or no, I still 

see the treatment unit as the school. The design equalized district 

factors over treatments, but it confounded school factors with the treatment. 

If this design was consciously preferred to a split-school design, the 

justification must have been interest in some social effect (e.g., spread 

of the disease in an inoculated community ) . 

It is possible for the unit of sampling and the unit of treatment to differ 
in other ways. One mignt sample Individuals and assign them to classes individ- 
ually, and then assign classes to treatment. Then the unit of treatraenL is clearly 
the class. Jonversely, one might sample classes and then assign individuals from 
r-n9r- classes to one or another independent treatment. 
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This brings us to . the unit "of theory". The 

choice of design is often constrained by practical matters, but 
the rationale for the design ought to come from theory. Theory 

need not be grand and abstract, but it does state a question in 
general terms. The wrong design may examine too broad or too narrow 
a phenomenon. Federal support for Performance Contracts was entertained as 
a national policy, but it was surely anticipated that each district 
would decide whether to enter such contracts. Hence a contrast of 
experimental and control districts would have been sensible.^ If 
the thought was that PC, once adopted, would become mandatory for 
all districts in the nation, the logical experiment would, on its 
face, be a period of nationwide trial. The contrast group could be 
another nation or the same nation in the pre-experlment 
period. The notion of taking the nation as the unit of treatment may be 
dismissed if theory says that every effect is mediated locally. In 
some contexts no such cla. i would be made. America's "noble experiment", 
the Eighteenth (Prohibition) Amendment, could not have been evaluated 
by studies of prohibition ac local option. 

To define a unit of theory is to argue that there- are boundaries in the 
social space which mark off entities that have properties of their own. Just 
how to identify "entities" or "systems" for scientific study, where object 
boundaries are not apparent to the eye, is a question of long standing in many 
fields including sociology (D.T. Campbell, 1958). Some social entities appear 
to be good subjects around which to build theory; they cohere, and their members 
undergo common experiences. Other groupings (e.g., by first letter of one's 
name) have nc more than momentary power to produce a common effect on the group 
members. Groups that are real for some purposes (e.g., college majors) are 
unlikely to be the groups around which some other aspect of behavior ^e.g., 
social life) is. organized. Groups that are interconnected in some respects, 
gpi^ part of a larger system, may function as independent svstems with 
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respect to certain phenomena. 

Analysis at the level of the collective is likely to have no justifi- 
cation in science or in policy studies unless the collective is in some 
real sense a carrier of an effect. Shively (1969, p. 1184; his italics) 
warned against calculating ecological correlations, and presumably would 
warn against regressions also, "unless the theory with which we are working 
conceives of the aggregations we are using as real entities, for which no 
other type of aggregation can readily be substituted ." In educational 
research it does seem reasonable to think of classrooms and schools and 
districts as having real enough effects. To analyze at the group level 
seems to invite no greater penalty than the disappointment of looking for 
a group-level effect and finding it absent. In other kinds of research, 
the social fabric may be so seamless that no unit of theory can be readily 
defended. Then some model other than that of members-nested-in-units 
may be required. 

Hannan (1971) considers that the so-called aggregation problem in 
sociology (and economics O arises as much from the units of theory as 
from the units of aggregation for analysis. (Where, as is usual, sociological 
and economic data are collected naturalistically , no question of unit of 
assignment arises, and hannan does not concern himself directly with 
sampling units.) 
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Macrosociology and macroeconomics seek generalizations applicable to 
large collectives, whereas 

microtheory seeks to generalize about processes occurring among small 
units (e*g», social participation or purchasing behavior of the single 
family) ♦ Propositions at the two levels may be cast in terms of the same 
construct (e*g,, per capita income) ♦ Some sociologists — Hannan points 
to Parsons as arch-example — expect "homology", with the same relations 
emerging at all levels once the right set of variables is identified* 
Others* including Hannan, expect the micro path coefficients linking 
homologous variables to differ from the coefficients generated by macrodata 
on the same sample* He sees the ultimate problem not as picking a unit 
of theory but as of developing a "between levels" theory of aggregation 
processes, to permit 

reductive interpretation of macro data and aggregative interpretation of 

micro data* Micro, macro, and aggregation relations together constitute 

an ideal theory for Hannan. 

What social scientists have generally seen as a problem of data analysis 

has a striking correspondence to a major issue in the philosophy of natural 

science, reductionism. Daniel Bell (1975) discusses the attitudes that 

I 

Physicists, in particular, have taken to the proposition that relations need 
to be developed in an integrated tranner so that one can read upward fror* 
subnuclear processes and downward from phenoirena on the hunan scale and 
larger. Hox^r close Bell is to our concerns is indicated by the fact that 
at the outset he quotes S. Mill to illustrate "the 'naive' 

formulation of the issue": 

Human beings in society have no properties other than 
those which are derived from and may be resolved into the 
laws of the nature of individual man. 



1,21 



The need to move back and forth between levels of aggregation 
is minimal for the physicist, Bell suggests, because the energy that 
goes into processes within atoms — for example — is 
some orders of magnitude below the energies that go into the kinetic 
motion of gas molecules at normal temperatures* The proposal to look 
at each level in turn in social research cannot use that rationale; 
the energy in individual transactions must be of the same order of 
magnitude as most contextual effects arising 

from the class. Apart from that, my proposal to examine class-level 
relations and then to examine individuals-within-class isin striking 
parallel to what the physicist does. Having studied molar gas laws 
to his heart's content, he turns to the study of forces binding atoms 
within the molecule. But he seeks conclusions about atoms within a 
pure gas, not a conclusion about atoms without regard to molecular 
context. 

I can see the benefit to be gained ~ in principle — from 
Hannan's integrated theory. Suppose our question is. What will it do 
to students' life chances if we require passage of an achievement test 
before allowing them a high-school diploma? A local ruling, a state 
ruling, or a national ruling would have different effects, and a theory 
of the type Hannan has in mind could hypothetically forecast them, 
without experimenting at all three levels in turn. I have no faith 
that social scientists can attain such powerful theory (Cronbach, 1975). 
If X am right, it is necessary direct one's inquiry tc whatever level is 
most pertinent to the question of theory or practice of most immediate 
concern. 

o I 
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Units within hierarchies . Almost all previous writers have 

spoken of the contrast between analysis of elements and analyses of 

collectives, Abt and Hannan being recent examples. My plan of attack 

is instead to pick one level of collective and examine (a) relations 

between collectives at that level and (b) relations of elements within 

collectives (rather than relations of elements without regard to the 

6 

boundaries of their collectives). 

Almost all my argument will be confined to two stages — e.g., pupils 
within classes. I shall consistently treat a measure on the smaller unit 
as a composite of a mean for the larger unit and a deviation from that mean. 
(E.g., mean age of class, and pupil age minus class mean.)^ 

There is no logical difficulty in extending sucn a series of components 
over pupils-within-classes-within-schooJs-within-districts. With a dependent 
variable at one level, all components of independent variables associated 
with that level and higher levels may enter the analysis. (Also, a statistical 
index derived from a component at a lower level may become an independent 
variable. E.g. , the s.d. on a predictor of pupils within a class may be used t 
account for class mean differences on the dependent variable.) 
The same 

principle could be applied in the reverse manner, the class mean being a 
composite o^ pupil score snd class-mean-ninus-Dunil-score. Then tne 
deoen'^f^p.t variable at one level is exolaine*' I'/ co-.ifioneats a. thnt l3vel 
an*:^ ioupr level 3, Tor sore studies this may be more appropriate than 

downward decomposition. The chief difficulty is that the deviation 
score is correlated with the lower- level score, which complicates 
analysis and interpretation. (The score X*X(= X - n^X) has a zero 
correlation with X,) 
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This upward decomposition was central to Blau's (1957) definition of struc- 
tural (context) effects. U-hen Hauser restated Bowers' question in terms of 
regression weights for X and X.X,he was proposing a similar upward decompo- 
sition. This formulation seems to be the one that springs to the mind of 
those sociologists who choose not to express relations directly in terms of 
X and X. Perhaps this follows from the obvious causal principle that X 
arises from X and from the sense that a context effect is something added. 
A reference-group perception, however, might easily operate causally in 
terms of X - X , as is seen in Meyer's hypothesis (1970, p. 63) that a 
student's judgment of his own ability - „hich affects his aspirations - 
arises from his standing relative to his group. I do not argue that X is 

prior to X; rather (p. 3.3 ), I make X prior to X and X - X . I also parti- 

often 

tion Y, which has not^been done in the sociological literature. Insofar as 
I have a causal preconception, it is that I often determines what educational 
activities are offered to a class or student body. But no one causal position 
fits all studies. 

Units in areal analysis . The procedure probably does not apply well 
to all the kinds of nesting considered in writings on aggregation. Where 
exposure to treatments takes place in South Sea islands, and the islands 
are assembled into collectives each of which unites the islands under its 
own policy, hierarchical analysis applies. This is equally the case with 
classrooms nested within schools. In the old problem of slicing up a time 
series, however, months nested within years or biennia are not islands. A 
price movement is not contained within a month or a year. Areal units 
similarly flow into one another and, at least in agricultural research, 
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the reporting area corresponds badly to the causal variables such as 
weather and marketing facilities. 

A model of units nested within larger units may be unrealistically simple 
even in schools. In simpler days, pupils were nested within classes, firms 
within industries, families within communities. Today, even the 9-year-old may 
work in a dozen groups and individual settings with several teachers and 
aides, all in the course of a school day. Similarly, the firm is often a 

conglomerate', and family members commute and so come under the influence of 

several communities. 

Streuning (in Streuning & Guttentag, 1975) points to the 
importance, for evaluation of health services, of an ecological analysis. 
He proposes to divide a catchment area with a population of perhaps 
400,000 into 100 units, and to correlate characteristics of the unit 
with indicators of use of services. His plan calls for reducing a 
large set of predictor variables and a large set of dependent variables 
by cluster or factor analyses, followed by calculation of multiple-^ 
regression equations relating dimensions from the two sets. The 
coefficients in these equations would generate hypotheses about 
^^^^ increase use of services. Streuning 's plan 

probably represents, at its best, the state of the art in using existing 
records to understand and improve a social program. 

The question to consider here is whether the choice of unit matters. 
Streuning chose not to use a unit smaller than the census tract, presumably 
because many data available at that level are unavailable at lower levels. 
He chose not to use a larger unit because a sample size of 100 or more is 
recommended for inference from sample correlations. Streuning does not argue 
that the tract boundaries relate to any gradient of action. Some actions — 
say, reducing the income level at which a certain service is given without 
pnV/^'^e — are conditional on individual characteristics. Some ~ establishing 



a hot-line phone for pregnancy counseling — are citywide. Streuning could 
well have thought more about the choice of unit. The N of 100 units should 
not be a ruling consideration for Streuning. Insofar as he is seeking 
policies for this single catchinent area he is dealing with a population, not 
a sample. Insofar as he "is seeking theoretical insight he is dealing with a 
sample of size one (of many catchment areas in the nation). Although it may be 
distressing not to have some data for units smaller than the census tract, 
this is not an insuperable barrier to using smaller units; one can assign 
the value for the tract, or a prorated value, to each of its subunits (say, * 
an apartment building). 

Geographical areas can be divided coarsely or finely. The Yule- 
Kendall computations on wheat and potatoes (p. 2.12) show that correlations 
change with the unit of analysis. Some corre- 

lations will change more than others. If so, both at the factor-^ 
analytic stage where he reduces the predictor set and at the multiple-^ 
regression stage, Streuning could expect to get different results as 
he alters the unit of analysis. 

As no areal unit can be seen as the unit of theory in 
Streuning 's case, it is uncertain what procedure to recommend. A 
next step appears to be to collect data that are disaggregated to the 
greatest degree possible, and to apply the proposed methods of analysis 
across and within various alternative levels of aggregation. A serious 
problem for studies of human ecologies — once we leave the neat hierarch- 
ical partitioning of schools — is how to bound "an ecology**. Arbitrary 
slicing of areas along the lines of large aggregate reporting units 
defined without reference to the problem in hand seems certain to 
misdirect thinking. 
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Notes for Section 1 

"''In this report, treatment is a general term. It includes controlled and 
administered instructional or therapeutic interventions, but it also 
includes variations in services that sprang up without control (e.g., 
talkative teachers vs. listening teachers). Any service or activity or 
policy that could in principle be installed deliberately is a treatment. 
Although many examples in the first parts of this paper refer to treat- 
ment contrasts, the theory to be developed considers relations within a 
single treatment. It therefore applies not only to experiments but to 
naturalistic studies (e.g., of utilization of educational TV). 

Much of the discussion of units in sociology has been concerned with 
correlations between variables that are present simultaneously (e.g., 
ethnic and religious identifications). Some of my thoughts about asym- 
metric relations of treatment to outcome may not fit these studies where 
there is no manipulable variable. 

1.6 an appendix. Abt did report child-level significance tests, claiming 

as many as 3800 d.f. 



1,9 " Lumsden (1976) takes vehement exception / 

to my advice regarding disattenuation . I should apologize for recommending 
disattenuat ion and yet reporting attenuated results throughout this work. 
The examples are all secondary analyses, and at best I could show the effect 
of a guessed reliability coefficient on any results. 
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Notes 1 cont. 

Page 1-27 
3 

1.11 This is not true, of course, when groups are formed at random and 

treated individually. Then any relation of X to any variable is nothing 
more than a "composition" of the relations of X. When aggregation 

is after the fact, and individuals within an aggregate have been treated 
independently, the statement may or may not hold. Consider race as the basis for 
grouping. Income within the race group may be an indicator of successful 
performance; income as a between-group variable is heavily colored with 
market inequity. The variable in the total" pool is a mixture of the two 
cons Cruets. 



1.12 ^^Blalock (1964, p. 98) says that when relations of Y to X and X differ the 
relation must have been altered by the entry of certain causal variables at 
one level and not the other. I prefer to say that X and X are distinct 
variables. The properties of what the physicists call a critical mass 
arise from the aggregate itself, not some "additional variable". The whole 
in this case is more than the sum of the parts. 

1.13 In cne one trial of such a scheme that we have made, the three analyses 
generated much the same standard deviation of residuals in a cross- 
validation, though the regression equations did not weight the variables 
in the same wav. 
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1.13 Even then, research conducted in schools as now constituted is a poor 
basis for forecasting what will happen when new assignment rules are 
adopted, 

1.19 ^From schools contrasted within districts one can generate a difference 

score between treatments for each district • What appears to be a school-'^ 
level design is thus capable of being given a district-level analysis. A 

policy decision that PC should or should not be adopted district-wide 
hereafter, on the basis of a difference in this study, does require the 
assumption that the effect in a PC school is the same whether or not comparabl 
schools in the district have PC. The choice of unit for assignment of treat- 
ment thus rests on a theoretical proposition. Even the district may not be a 
large enough unit for adequate evaluation of a policy. The Federal government 
entertained the idea of encouraging district use of such contracts; but if the 
designer of the evaluation could offer grounds for believing that payoff 
would be greater when all the districts in a county went on the contracting 
basis, then sampling scattered districts would not disclose important data 
on the working of the policy. 
5a 

1.19a There is a substantial literature in economics and econometrics that I 
have made no attempt to review. 

1.22 ^Analysis of elements within collectives is of course the basic method 
in comparative studies that take one nation or one school at a time, 
and then repeat the study in another collective. I know of no 
instance in which such a comparative study has formally analyzed 
across collectives as well as within collectives. 

1.22 ^The dichotomoufi variable such as Black/white becomes = per cent 
black for the class, and 1 -Tl* or -7f for black or white 
individual, respectively. Although the two components are linearly 
independent, the variance of the deviation component is nonlinearly 
ER^Crelated toT. 



2. Units in various research contexts 
The problem as seen in research on Aptitude ^ Treatment interaction 

I was brought to face the aggregation problem while R. E. Snow and I 
were completing a review of the numerous studies on Aptitude x Treatment 
interaction (ATI; Cronbach and Snow, 1976). The issue in such research is 
whether outcome--on-aptitude regressions (hereafter, I refer to Y-on-X 
regressions) have the same slopes within the treatments — say, within 
competing teaching methods. Findings on interaction might give the school 
a basis for assigning a particular student to whatever mode or style of 
instruction is likely to produce best results for him. The investigator 
naturally approaches the problem with the prychologist 's bias, asking how 
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individual characteristics relate to individual outcomes and hence 

to choice of treatment for the individual. Conventional thinking 

about aptitude effects on individuals is not fully applicable. Snow and I 

now realize. A practically relevant conclusion ought to describe 

the result to be expected under the usual school conditions. The, 

school teaches students in groups — even in much "individualized 
instruction*'. 

Sample size for regression analysis . Investigators conducting ATI 
studies in classrooms have, with rare exceptions, pooled the data for 
subjects within a treatment before analysis, ignoring the class grouping. 

They have taken the individual student as the unit of analysis.. 

Even in a simple t-test on the outcome in a true experiment^ a 
calculation at the group level is less likely to reach significance than 
a calculation at the individual level. Assuming uniform group size n , 
the two _t values have approximately the ratio nriy • Thus if riy 
has a reasonable value of 0.30 and n = 10 , the individual t is 
3 times the group t . In regression analysis the lack of power is 
even more serious. (See Section 7.) 

Something like 100 degrees of freedom Is required to reject the 
hypothesis that a regression slope is zero when the actual slope is large 
enough to be of considerable importance — say, a standardized slope of 
0.4. (See Cronbach & Snow, Chap. 3,4.) Consequently, 100 

classes (!) must be observed to get a good fix on a between-class es 
regression. Pooling classes and analyzing individual scores* the 
investigator claims a large number of degrees of freedom; he -^^ then 
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more likely to be able to report a significant difference between 
regression slopes. Unfortunately, his significance levels are spurious 
unless he makes strong assumptions. If classes are the unit of sampling, 
the number of classes is the natural basis for statistical inference. 

The strategy of ATI research (and of much other social and 
educational research) will have to be modified, once it is recognized 
that the costs of the usual strategy are nearly prohibitive. Experiments examin- 
ing the difference between group-level regressions will be uninf ormative, in the 
sense that the prior and posterior probabilities of accepting the null 
hypothesis are nearly equal m a study of reasonable size. Only with the 
sample sizes attainable in survey research will one find it profitable to ' 
assess the "significance" of group-level regressions. 

The Maier- Jacobs study. One r^am investigating ATI did take classes as the 
units of analysis. Maier and Jacobs (1966) carried out a year-long 
experiment in many classrooms. Spanish was taught by programmed 
instruction in 39 elementary-grade classes; 17 by an "orderly" and 22 by 
a "scrambled" program. Maier and Jacobs analyzed the classroom means 
on various pretests and outcome measures and reported, among other 
conclusions, that the outcome means were similar in the two treatments. 
The between-class es regression slope of attitude toward programmed 
instruction (posttest) onto IQ was positive when the orderly program 
was used. (The slope was small because the s.d. of attitudes was very 
Small in the metric used, but the correlation was 0.75.) The implica- 
tion is that- duller classes liked the orderly program less than abler 
classes. In the scrambled treatment there was an effect in the opposite 
direction; programmed instruction received higher ratings in duller classes. 

ERIC 44 



Another set of between-class es regressions used IQ as the predictor anc. 
an achievement posttest as the outcome. Maier and Jacobs provided Snow 
and me with statistics on those variables for all cases pooled, from which 
we could calcula'te three sets of achievement-on-IQ slopes: 

Orderly Scrambled 

Overall (individuals pooled) 0.53 0.62 

Between classes 0.50 0.77 

Within classes (pooled) 0.55 0.52 

The differences are not enormous, and no sensible comment about 
significance can be made with a limited number of classes per 
treatment. Evidently, abler classes pulled ahead of 

duller classes, most strongly in the scrambled treatment. 

Second, IQ differences within the class related to outcome similarly in each 
treatment. This (like many other studies) denied the working assumption 
of the programmed-instruction movement of the 1960 's, that orderly 
step-by-step progression of instruction would largely erase the effect 
of IQ on learning. 
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Harvard Pr oject Physics . The evaluators of Harvard Project Physics, 
an innovative high-school curriculum, likewise collected data in classrooms scattered 
over the nation. In addition to individual scores on beginning-of- 
year and end-of-year tests, the investigators had information on the 
climate of each classroom, obtained by aggregating questionnaire 
responses of students. The papers of this project sometimes reported 
analyses at the individual level (cases from all classroom being 
pooled) and sometimes reported analyses of class means. The chief 
report published to date (Welch & Walberg, 1972) analyzed at the 
class level. The several analyses in this and earlier publications 

cannot be directly compared because they used somewhat different variables 

and statistical techniques. Interactive effects were reported. 

The studies suffered from a number of faults common in research on 
interactions at that time. Snow and I, after a critical look at the 

methodology of the studies (Cronbach & Snow, 1976, Chap. 10), were 

uncertain as to the dependability of the interactions reported. The 

comparison of main effects, using classes as the unit of analysis, 

was open to much less question. 

Walberg (paper in preparation) has recently reflected on the 
HPP experience. He comments that the research group received conflicting 
advice from methodological experts as to the best way to handle the 
mLxture of individual and class data they had amassed. His paper (in its 
current draft) goes on to list something like a dozen competing modes of 

analysis, several of which were tried by himself and his colleagues 
on one or another set of variables. His final paper will be of 
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obvious us. to p«so„s interested 1„ this report; but it v»uld be 
inappropriate to discuss the draft here. 

Only late in our work did Snow and I become aware that the interaction 
phenomenon has to be defined substantively as a between-groupsor within- 
groupseff ect . 

We came at last to see the importance of Wiley's view that response 
to treatment is not simply an individual-level process. 

Group characteristics ~ aggregate or global — may interact with treatments, 
and they may interact in a different manner than individual characteristics. 
Some pages on this theme were added to Chapter 4 of the Cronbach-Snox; 
book in the final stages of writing. 

Three kinds of process > To recapture the argument of those pages, 
it will suffice to consider the regressions of outcome Y onto aptitude X 

P 

(the score for the individual p) and onto the aggregate X , the 
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mean of this same aptitude over individuals in class c. At least three 
kinds of causal phenomena may enter 

into an observed interaction or an observed regression slope calcu- 
lated from individual-level data* 

There is a sample of classes. These need not have been assembled 
at random, but the classes are divided at random between two treatments. 

A common outcome measure is obtained on all persons, with the 
following hypothetical results: 

a. The overall mean is the same in treatments A and B. 

b. The regression of outcome on a measure of initial 
ability or achievement is nearly flat — say, a slope of 0.2 — 
when calculated on all the persons in the A group. 

c. In the same metric, the individual-level 
regression slope is 0.6 in the B group (Figure 2.1). 
I.e., students with superior aptitude do considerably 
better in B than students of low aptitude, and consid- 
erably better than their high-aptitude counterparts in 
Treatment A. 

C Three alternative explanations may be offered: 

1. Interaction at the individual level. 

2. Interaction at the class level. 

3. Interactive effects within the class. 

A concrete example will help in what follows. In Treatment A (didactic), 
history students study immigration problems of the U.S. through textbooks 
and teacher exposition. Treatment B is inductive; the students examine 
original documents, newspapers, etc. and work out conclusions through 
discussion. 
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(i) Result observed in 
pooled data 



Y 




(ii) Regression generated 

by individual effects, 
with between-class and pooled- 
within-class slopes equal 



Y 



(iii) Regression generated 
wholly by class-level 
effects; between-class 
slope greater than 
pooled-within-classes slope 



Y 




X 




(Iv) Regression 



generated 



wholly by within-class 
effcts; between-class 
siope is zero« 



Y 




X 
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Figure 2.1. Alternative ways of generating an overall 
regression slope of 0.6 
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The chree kinds of effects can be examined without reference to differences 

in slope (interaction^ The prior problem is to explain the within- 

treatment regressions. How might the steep regression (all cases 

pooled) in the inductive treatment have arisen'' „c 

1 -r."^^.^®"- US assume that the groups 

differ at the outset of the study only with respect to X and irrelevant 
variables; i.e., there are no specification errors. Also, to hold questions 
of attenuation effects in abeyance, let us assume that X is perfectly measured. 

(1) Individual level. Psychologists have regarded regressions 
of outcome on aptitude 

as manifestations of individual aptitude, working on lessons delivered to 
the individual. If a Y-on-X regression is steep, the interpretation has 
been that the person with a high score on the aptitude test somehow 
processes material more efficiently or diligently than the low- 

aptitude student. E.g., the fast reader has a considerable advantage 
^^^^ numerous documents must be scanned and evaluated for 

relevance. Interpreting the observed interaction as of type (1) implies 
that the result would be found if students were exposed to the teaching 
method individually. Panel (ii) is consistent with this interpretation, 
rne within-class regressions depart only by chance from the pooled regression. 
Any particular configuration of 

regressions could arise from combinations of two or three kinds of 
effects, however. 

The modern interest in ATI st=r.i:.e. .To: a concern with individual 
assignment in education. Cronbach and Gleser (1957) established 
a rationale for validating such assignment rules. To Justify 
assigning students to alternative treatments (e.g., to regular and slow 
sections), it was logically necessary to show that a steeper regression 
existed 1„ one treatment than In the other. .Uhough the matter „as never 
dl.cus.ed fully, Cronbach and Gleser ass-..ed chat data would be collected 
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by assigning students from the whole aptitude range to each treatment, just 
as a selection test is validated on a sample from the whole range of applicants. 
In suggesting that regressions observed in wide-range groups would guide 
the formation of groups more homogeneous in aptitude, Cronbach and 

Gleser implicitly assumed that the regression slope reflects 

the response of the individual to the treatment. His expected 
outcome in a given treatment was taken to be the same regardless 
of ihe choice of persons to be treated alongside him. 

(2) Group level* An alternative causal hypothesis is that the 
level of aptitude in the class as a whole determines the effect of a 
treatment. Would not a steep slope be found in Treatment B if the source 
material selected for interpretation in abler classes (as identified by mean 
aptitude) were much superior to that selected 

for use by the dull groups? Such a mechanism, triggered by the class 
average, perhaps serves both the abler and less able irembers of the 

able class. Under this hypothesis the richness of the experience depends 
on the environment, not on the abilities of the students working singly* 
(See Panel iii.) 

A similar group effect might be found if the teacher regulates the pace, 
forcing the discussion to a penetrating level in the able class and 
leaving it superficial in a dull class. 

(3) Comparative effects within the class. The third possibility 
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is that the regression slope is determined by effects within 

the class. Suppose that, in classes using the inductive 

method B, the ablest students within the class steal the show. They dominate 

the discussion; they are rewarded for locating materials more rapidly than 

others, an'' so are encouraged to redouble their efforts. 

The duller members of the class, systematically outshone, 

come to rely on their abler classmates to keep things going (Panel iv). 

Another possibility is that the typical teacher will habitually interact 

more with the superior members of the inductive class* 

Effect (1) presumes that the outcome is a function of the student and 
the choice of treatment, not depending in any systematic way on the makeup 
of the class. Effects (2) and (3) presume that class makeup matters, that 
two students whose aptitude is at the population mean will achieve differently 
when one is superior to the classmates he draws and the other is assigned to 
a group abler than he is. In Panel (iii) a student gains by entering a class 
where he is below average; in Panel (iv) it is the other way around. 

The shallow slope in the didactic treatment A might be explained 
by the near-absence of all the three types of regression effect. But 
effects can balance each other. A shallow pooled slope in the didactic 
treatment may result if one effect is positive whereas the other two 
are close to zero or one is negative, it is possible for a slope to 
be negative (e.g., when a teacher concentrates effort on the least 
able members of the class). 

The difference of 0.4 in slopes in the pooled analysis 
( - 0.6 ~ 0.2) can arise in principle from an interaction effect of 
0.8 at the individual level, an effect of -0.4 (say, 0.2 - 0.6) at the 
group level, and an effect of 0.0 at the within-group level. Figure 2.2 
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sketches two out of many possible configurations that yield pooled slopes of 
0.2 in A and 0.6 in B. An interaction observed in pooled data from many 
classes obviously cannot be directly interpreted. 

The problem I began with, then, was this; What analysis 
comes closest to describing separately the interaction effects of the 
three kinds? And what problems of interpreting the findings arise? 

It has perhaps already become apparent to the reader that the problem 
is not adequately formulated in the paragraphs above. Once a distinction 
between effects at the individual and class levels is made, it is natural 
to separate within-classes and between-classes 

regression analyses. At best, this resolves into two components an effect 
that has three possible sources. I see no way to disentra'ngle 

the effects in analyzing data from the usual designs. 
Aggregation effects 

The sociological and econometric literature contain many papers 
on what is usually referred to as aggregation bias. 

Robinson (1950) set in motion the discussion of aggregation bias 
in sociology by demonstrating the disparity. Later papers have 
judged that Robinson was shortsighted 

Co emphasize correlations rather than regression coefficients, but 
the same issues arise when regression coefficients are compared. 

Although the literature triggered by Robinson's work is voluminous (see cita- 
tions in Dogan 6t Rokkan, 1969, and Hannan, 197 1), thinking has not 
moved steadily forward. Arguments have become more complex, but 
consensus is lacking. As late as 1971 Hauser could say that **sociolo- 
gists have not yet fully exploited the insight it [Robinson's article] 



ERIC 



53 



'TREATMENT A 
Coefficient =0.2 



TREATMENT B 
Coefficient =0.6 



at 2,11 





X X 

(i) Difference in slopes at the between-group level 



Y 





(ii) Difference in slopes at the within-group Level 



Figure 2.2. Distinctive configurations generating the same overall ATI 
(The intraclass correlation is assumed to be 0.75.) 
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provides into the interpretation of relationships at different levels 

of aggregation*" (His p* 11-12*) He went on, in echo of Riley (see p* 

l.lA) to speak of a "misunderstanding" arising from the view that 

effects at the group level are sociological in nature, and individual 

effects psychological ("internal to the individual"), Firebaugh (1975) 

rejects even today the complacent view of Scheuch (1966) that most of the 

problems are "intellectually settled". 

The original concern for aggregation bias had to do with the 

effect of arbitrary contiguous grouping. Yule and Kendall (1950, p. 

310-11) chose the example of a correlation between potato and wheat 
per acre 

yields^. It is necessary to use data at some aggregation larger than 
the potato patch; in the records, counties were the smallest available 
unit. It seems reasonable a priori to group neighboring 

counties. In a series of correlations for 48 counties, 24 pairs, then 
12 sets, and finally 6 regions, the correlation moved up from 0*22 to 
0.76. Yule and Kendall properly concluded that a correlation is specific, 
to the unit chosen, but failed to consider deeper interpretations.. 
Some of their vithin- region covariance was part of the overall between-"^ 
county covariance. Aggregation redefines the question 

investigated by moving certain information out of one variance 
(covariance) and into another. 

Harnqvist (1975, pp. 101-102) offers some educational examples with 
non-arbitrary grouping. He correlated measures collected in the International 
Study of Educational Achievement at the individual, school, and country levels* 
The rt^specLive correlations of reading comprehension with a measure if school 
satisfaction among 10-year olds were 0.19, 0.18, and -0.77 — a change of 
direction as well as size. He also shows that it is possible for aggregate 
correlations to be smaller In absolute value than lower-lovel «*orrolations . 
The successive correlations for reading with science knowled^^e at age 14 
,9^7ere 0.60, 0.76, and 0.54. 
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Discussion in recent years has centered increasingly on causal 
interpretation. A number of writers rejected Robinson's emphasis on one 
correlation as an estimator of another, making the valid point that the two 
correlations reflect different phenomena or processes. It is the failure 
to arrive at a clear logic for interpreting the two that continues to plague 
the field. Thus Patricia Kendall and Lazarsfeld (1955, p. 295) discussed data from 
The American Soldier where a within-groups regression coefficient was posi- 
tive and a between-groups coefficient negative. Soldiers who had been 
promoted gave more positive answers on a question about promotion chances 
in the Army than men who had not been promoted. Ratings related positively 
to individuals' actual promotion. This was true for military police and 
also true within the Air Corps. But. promotion rates were higher in the Air 
Corps whereas the rating on promotion policy was higher in the MP's. That 
is, the between-group regression of rating on mean actual promotions was 
negative. Kendall and Lazarsfeld concluded that the group phenom- 
enon reflects shared experience and perceptions and is not just an aggregate 
of individual data. 

Oddly, they abandoned this caution in another instance. Soldiers who 
chose their own assignments liked their jobs better than others did. Units 
where choice was commonly allowed were most likely to be rated by their 
members as good units. So the between-groups and witliin-g roups slopes are ~ 
similar. Kendall and Lazarsfeld said that the individual relationship 

corroborates the result" from group <)-.^a. This appears suspiciously circular. 
If the results had disagreed (as they could have), would the writers not then 
have insisted that the group and within-group data bore on different phenomena? 

1-or the sociologists in the Columbia group ,most processes (e.g., gener- 
ating a certain income level) occur to the person in a group context, and no 
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process truly operates on the individual in isolation. Coleman (1954) urges 
contextual interpretations because otherwise sociology becomes no more than 
an "aggregate psychology". It appears that one 

can interpret an aggregate regression coefficient as causally similar to an 
individual coefficient only by assuming that group 
members developed their scores on the variables 

independently, two members of a group sharing an experience no more often 
than members of different groups. Moreover, even if Y scores of individuals 
were generated independently, if groups were formed on the basis of any 
variable that correlates with Y'X this will produce a difference between 
the within- and between-groups slopes. (See Section 3.) There can be no 
general warrant for substituting group data when individual relations are 
of interest. Conversely, an analysis at the individual level describes a 
composite of within-groups and between-groups effects that is easy to misin- 
terpret. 
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Ecological psychology 

The context of human behavior is receiving increasing substantive 
attention in psychology. 

Roger Barker devoted a career to the study of behavioral settings, adopting 
the Lewinian position that the situation into which the individual moves 
Joes as much to shape behavior as the personality of the individual. 
Barker's interest was in the microecology . Bronf enbrenner (1974, 1975^ 1976) 
is less concerned with immediate situations and more concerned with the 
totality of the individual's environment. Though each individual's 
cultural setting may be unique, from Bronfenbrenner 's point of view 
there are statistical similarities .in the experiences of individuals 
living in the same neighborhood or participating in the same community 
culture. Neighborhood data are to be considered in the same light as 
classroom data, and subjected to separate between-neighborhood and 
within-neighborhood analyses. 

Bronfenbrenner suggests that the effect of an experi- 
mental intervention (e.g., a program for disadvantaged children) is likely 
to be small unless it radically changes the ecology of its subjects. 
Moreover, he questions the appropriateness of a strictly individual 
psychology^ insofar as the subjects are part of the same 

interactive community, the community and its members may constitute a 
single "subject". That is to say, a treatment 
n.ay have a substantial effect in one community 
and a negligible effect or the opposite effect in another- 
Perhaps Che contrast depends upon characteristics that could have been 
identified at the outset of the intervention, or perhaps on fortuitous 
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occurrences. Bronf enbrenner and C. R. Henderson (personal communication) 
have embarked on a program of disentangling between-groups and within-^groups 
components of variance that apparently is probing more into technical, 
statistical issues than 1 have. 
Evaluative studies and scho ol-effect studies 

The importance of group effects is slowly becoming recognized in the 
literature on educational evaluation. As long ago as 1967, 
the Wiley-Bloom-Glaser debate took place at a conference on evaluation. 

In the same year Bock and Wiley (1967) argued that the best 
design for a comparative educational experiment is often to assign 
classrooms — not pupils — to treatments, at random within schools. In 
data they studied, the component for differences among pupils usually 
accounted for 20-30 per cent of the sampling variance of the outcome mean 
within a treatment, with pupils regarded as random. The remaining 70-80 
per cent of the sampling variance arose from schools and from classrooms 
within schools. In their data, classrooms 

accounted for far more variance in arithmetic fundamentals than schools, 
whereas schools (neighborhoods?) accounted for more variance in reading. 
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The Bock-Wiley paper is one of a series of "school-effect" studies that 
ask how much variance in achievement, aspiration, or career level is "attri- 
butable" to school differences. Another kind of school-effect study relates 
specific characteristics of the school (verbal ability of teachers, say, or 
mean sense of efficacy in the student body) to outcomes, as in the Coleman 
report. 

Werts (1968), among many others, reacted critically to the Coleman 
report. Coleman's analytic device was to partial the school means on family 
background and similar student characteristics out of the final achievement 
test scores of individuals. The residual school effect (percentage of 
variance in individual achievement accounted for) was then interpreted as 
an index of the impact of the school. The Coleman report left the impression 
that excellence of facilities and other supposedly valuable features of the 
school program had little or no correlation with competence of graduates. 
Good students tend to be found in schools that have good facilities, hence 
the variance due to treatment overlaps the contribution of student ability. 
Partialling out student differences at the first step arbitrarily assigns 
the overlapping variance to student characteristics and not to treatment. 
Werts advocated a partitioning due to McNemar which evaluates the unique 
contributions of student characteristics and school characteristics and 
leaves their overlap as a third fraction of the predictable variance in 
outcome. The distinction between the individual and group levels of 
analysis was touched on in several of the papers, but became a focus of atten- 
tion only recently. Coleman himself (1975) has now acknowledged the validity 
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of the objection to partialling out family variables in estimating school 
effects, but says that critics misperceived the purpose of his 
1966 analysis. 

Luecke and McGinn (1975) contrasted Coleman's method of analysis 
with another adopted by Project Talent. Among other conclusions, they 
report that Coleman *s method systematically overestimates the effect on 
achievement of family background (vis-^-vis effects of teacher and school 
quality) • Their procedure is to simulate the generation of student achieve- 
ment over five stages (years of schooling) by setting up a causal model 
and specifying parameters of the causal variables including the correlations 
that represent causal links. Far more is at issue in their paper than the 
level of aggregation. They are concerned with ''dynamic*' effects in a long- 
continued process, and with the fact that students change classes and 
schools. Scoring the quality of the student's own 
teacher instead of using the aggregate quality of teachers in 
his school also modifies conclusions. I find myself discontented with the 
Luecke-McGinn presentation, particularly in their use of certain zero-order 
correlations (e.g., family with first-year achievement) as the standard 
index of infl.ience against which other analyses are judged. The study does 
illustrate the potential of simulations for forcing social scientists to 
recognize the consequences of their analytic decisions. Simulations may have 
important uses in later work on aggregation per se . 
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Duncan (e.g., 1970) has been especially insistent that any plan of 

analysis rests on a particular causal model. Pedhazur (1975) has recently 

applied similar thinking to the effect of school-level variables on achievement, 

Figure 2.3 is a modified version of a figure on his page 247. I use circles 

for global school-level variables and squares for individual-level variables 

(which may be aggregated). Otherwise, the diagrams follow the conventions 

of structural models. B represents background factors such as parental 

income; S represents a school- quality variable such as per-pupil 

expenditure or quality of teachers; and A represents student achievement, 
unspecified 

U and V are causes or sources of error. The B-to-S relation in 

(i) implies that neither variable causes the other; the relation must 
arise from some prior cause. In each other diagram, B contributes to S, 
perhaps via parents' willingness to vote more money for the schools. 
Achievement is said to depend directly on S in (ii), and directly on B in 
(iv); in (i) and (iii) the effect of B and S is joint. Pedhazur has much 
to say about alternative ways of partitioning variance and of testing the 
adequacy of the several models, but he does not emphasize units as such. 
(Later in his article, he does identify the main sources in the units-of--" 
analysis literature but adds little of his own.) 

If (iv) is the model, analysis is wholly at the individual level with 
B as tne sole predictor. Adding S to the regression equation (with the 
same value for every student in the school) would falsely reduce the 
apparent contribution of B to A, and would imply that A depends on S. If 
this model does apply, the partial covariance s^^.^ will depart from zero 
only by chance. 
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If (ii) is the accepted model, analysis can appropriately be 
carried out at the group (school) level, with S as the sole predictor 
of the aggregate A, There would be no reason to supplement this with 
an analysis of the B-to-A relation within groups if one believes the 
model's assertion that the partial covariance a is zero. 

If (i) is the accepted model, partitioning of the variance is open 
to the ambiguities discussed by Werts. The analysis can be done at 
the group level, with the aggregates B and A entered along with S. A 
supplementary individual-wi thin-groups analysis of the B-to-A relation 
can be made. 

Model (iii) is analyzed much as (i) is, but the interpretation 
can be less equivocal. The predictable variance of the aggregate A 
is allocated to two sources: B-independent-of-S, and S. The S portion 
includes some indirect or joint influence of B, but this is not seen 
as an influence **of B on A"; it is an influence mediated by school 
quality. Here again, the model makes it reasonable to consider a 
B-to-A-within-S effect at the individual level. 

Most educational evaluations have been analyzed with individuals 
as the unit of analysis, even when using aggregate variables as in the 
Luecke-McGinn simulation. An example among studies of real data is 
the report on Head Start Planned Variation (Featherstone, 1973) in 
which the number of individual children receiving each treatment 
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variation was caKtn as Che sample size for it. The Performance 
Contracting experiment, 

other hand, used the average for a single school as a data point. 

Analysis at the classroom level is a third possibility — as in the 

Maier-Jacobs and HPP studies. 

In principle, the evaluator could recognize hierarchical nesting 

but apparently no evaluation report has done this. 

Pupils are nested within classrooms, classrooms are nested within 
schools, and schools are nested within districts. It is reasonable to expect 
that an innovation mandated from the top of the hierarchy — let us say, 
by the State Department of Education - will not trickle down uniformly 
to the pupils. Possibly districts or communities will have a strong mediating 
influence, in causing the innovation to work or in sabotaging it; desegregat.o, 
again comes to mind as an example. Innovations also succeed or fail at the 
level of the school, in the sense that strong school leadership can produce 
results whereas passive compliance wipes out the effect. Within the school, 
individual teachers conform or fail to conform to the treatment specifica- ^ 
tions, and add variation by the manner in which they carry out the treatment. 
And finally, of course, one expects individual differences within a classroom. 

If one confines attention to the mean outcome in a treatment, 
nothing can be learned from a hierarchical breakdown; properly weighted, 
a mean is a mean is a mean. It is in the variances and regression 
coefficients at the several levels that differences couia appear. 
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Extrapolation in interpretation 

To conceive of an interaction **at the individual level" when treatments 
have been -pplied to individuals within groups is to engage in treacherous 
extrapolation. Data collected 

on groups are being explained in terms of processes within the individual. 
Such an interpretation is conventional in educational research and not 
unheard-of in general psychology, but it is highly questionable. 

A treatment may be significantly altered by the very fact that it is 
administered to the subject when he is in company with other subjects. In 
the example above, the inductive procedure 

for teaching history would be radically altered by applying it to students 
individually, as that would allow no way to retain the important feature of 
group discussion. 

The ^"operational definition*' of the treatment consists essentially of a 
set of instructions directing the acts of the experimenter or teacher. Wlien 
this identical operation is shifted from a group context to an individual 
context, the treatment is likely to be significantly 

altered. The teacher's reprimand, or instruction to "Pay attention to . . 
is a different stimulus when addressed to the group in general than when 
addressed to the student in isolation. Thanks to social facilitation, doing 
a page of arithmetic drill alongside one's classmates is not the same task 
as doing it alone. Thus the operational definition has to specify 
individual administration or group administration — more than that, 
it has to specify the basis for constituting groups. 

As is well Known, evidence collected from applying one operationally 
defined treatment may indicate little or nothing about what will happen 
when a treatment with another operational definition is administered. 
O jmeti^es the change in the 
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operation (here, the change made by altering the context) makes a large 
difference, and sometimes it makes none. This has to be tested directly; 
experiments with one operation do not give direct evidence on another. One 
may reason indirectly if he has established a strong presumption that the 
change in operation never -ettern. Ber^jnann (in ^rank, 1501 n. :3) uses t>. 
example of the location of the apparatus in an experi-.entai room. 
The presumption that a shift in location makes no difference is so strong, 
Bergmann says, that we are willing to ignore a shift in that aspect of the 
specification. Bergmann is right in the abstract, but his example is 
telling in a way he did not intend. 

Gerald Hoi ton tells me of the experience of 
Fermi and his group when they first attempted to bombard the nucleus 
with neutrons, in an attempt to create artificial radioactivity. They 
got negative results when the apparatus was set up on one bench in the 
laboratory, and eot success on another bench. The critical difference was 
a marble surface on the first bench. Neutrons rebounding from the 
surface had no more effect than those directly fired at the target 
nucleus. The second table, with a wooden surface, slowed the neutrons 
while scattering them, and it was these rebounding slow neutrons that 
produced the effect Fermi was seeking. Hoi ton tells a similar story 
of Rutherford, discovering thorium. Rutherford's electroscope dis- 
charged when far from the open door of his laboratory because of 
radioactive gases in the air. When he had collected data with the 
apparatus near the door there was no discharge; the gases were swept 
away by currents near the open door. A presumption that a certain 
shift In operation has no effect, Chen, is a presumption made 
at considerable risk. 

6'd 
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Group contexts surely affect human behavior at times. Hence evidence 
collected by observing individuals behaving in groups is not a dependaole 
indication of what will happen in an individual experiment. Nor can evi- 
dence obtained in groups composed in one manner indicate what will happen 
when the groups are formed by a different procedure, unless a strong theory 
about the character of the context effects has already been worked out. 

Ihe group-level and within-group effects are observed in a sample 
of classes. These classes can be regarded as draTO from a population of 
classes formed by a certain process. The results can be generalized to 
that population of classes, i.e., to classes formed by the same process 
from a similar pool of persons. The inference is of a type commonplace 
in statistics. To make an inference to classes formed by some othei 
process or rule is just as much a leap in the dark as it would be to 
extrapolate from the treatment observed to some variant of it. 

Tlie experimenter may or may not know what process formed the 
classes he observes. If he formed them by randomly grouping members 
of a student body or by another formal assignment rule, he can gener- 
ate similar groups by that same process so long as the population of 
students is unchanged from year to year* If tht: groups were formed 
by the existing community and within-school processes, his findings 
will apply so long as the school population is stable and those 
processes continue to control class membership. He cannot assume 
that Che findings will apply if a new grouping procedure is installed 
following the experiment. Altering nothing but the size of the 
in^itructional groups might be enough to change the relation of Interest. 
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This line of argument can lead in two directions. (1) To be 
conservative, the person conducting an experiment in intact classes 
will limit his conclusion to classes formed in the same manner. He 
should specify the process that formed the classes or the characteristics 
of the classes, in such a way that others making use of his research 
can judge whether their classes resemble his in composition. This 
extends the usual recommendation regarding description of a subject 
population. In classroom experiments, the class is the subject and 
the characteristics of the subject classes should be brought into the 
open. (2) A liberalizing step is to regard the assembly rules as 
treatment dimensions. In the course of a long program of work, 
particularly work oriented toward theory, an Investigator varies 

the specifications for an experimental treatment. The successive 
variants form a collection that can be described by parameters, and 
the varying effects can be described as a function of the parameters. 
When this process is well advanced, the investigator can make reason- 
able predictions about treatments that have not been roadtested. 
Just as a collection of manipulated treatments has parameters, a 
population of classes has parameters. Applying different sorting 
processes in successive experiments would build up some theory. 
One might be able then to make limited extrapolations to classes 
formed in ways other than those directly observed. 



3. A mathematical model 

The regression equation n.eded for distinguishing the effects of 
interest can be built up' in steps. Confine attention to a single treatment, 
and identify persons p as members of groups c . The person has 
scores Xp and Yp , which may for emphasis be written X and Y 

The model could be set up in terms of and , the *'true'' or "universe 

A p Y p 

scores" that would hypothetically be obtained by exhaustive measurement. In 

this section the model is in observed-score form. Universe scores will be 

taken up in Section 6. The class mean the mean over the fixed class 

c 

of Yp (p . c) ; likewise for X . I should note also that, so long as 

questions of statistical inference are held in abeyance, the model applies 

CO dichotomous variables as well as to continuous ones. 

This section can be read as if all collectives have the same number 
of members, \flien the number is variable, the definition of any parameter 
involves a weighting decision. See Section 4, 

Definition of componenti> 

The Y score may be divided into general-level, between-group, and 
wi thin-group components in the usual manner 

(3.1) Yp = + (u^ . + (Y^ ) . 

c c ^c c 

The between-groups component divides into a part predicted by the group mean 
on X nnc\ a residual. 

It is to bt» noted thAt the same regression coefficient serves to predict Y from . 

c 

Tht' within-^,roupsef feet is decomposed in two staf^es. Write r^^ for 
the tomTHon within-r,roup5^rcj;ression coofficienc that be.st accounts for the 
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sura of squares within groups. Then 

(3.3) (Y - ) = (X - y ) 4- 5 

c c c c 

But within a particular group the regression slope need not equal 

3^ Ich leads to the further decomposition: * 

epsilon ^ c c c 

Putting the equations together gives this series of components: 

Between, predicted 

Group residual 
Common within, predicted 

Specific within, predicted 

Person residual * 

The overall slope 6^ considered by those who analyze at the individual level 

is a composite of and g^. As shown by Duncan, Cuzzort, and Duncan (1961, p. 66): 

where nx is the intraclass correlation of X [equal to <^^Cy^ )/o^(X - )]. la 

c ^c 

Coupled with the argument on pp.4 .3-4, this formula has an important 
Implication for those who try to interpret 6^ in typical educational 
studies. ^ is a weighted average. In studies with a modest number of 
groups, is badly estimated, though B^. may be well estimated. The 

Lirger the value of \' , the more the error in makes for errors in 

. (The investigator usually has the illusion chat numerous cases entered into * 
the latter calculation*) 

PuttiiiK i,^. on the Icit side of (3.5)^ I leave it un.inaiyzed. 
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In a two-treatment study, includes the treatment effect as well as the 

general mean. In the two-treatment study of the usual type, groups are nested 
within treatments and the treatment constitutes a third level in the hierarchy. 

the model (35) could be defined with EB^ replacing g . In general this 

c W ° ' 

change decreases the "common within, predicted" variance and increases 

the specif ic-within variance. The definition in terms of g is more conven- 

w 

tional, often being associated with an assumption that group membership is 
random and the Y, X distributions within classes homogeneous. 

Ihe distinction has little importance for descriptive statistics. For 
some sets of data we have calculated b^ and b^ and found that the two 
differed negligibly. The expected value 



=^ E [ E (X - )(Y - )/ Z (X - u„ )2] 
c P ^ c ^ c p P \ 
P€c pec 

does not generally equal 

P c c p P \ 

If more than one X is available, for every subject, weighted composites 
will account for more variance, between and within groups, than the single X 
The model can be extended by introducing, for example, a B^i . S52 ' 
It is comparatively difficult to think about the multivariate case, however; 
the best composite predictor between groups may differ from the best within- 
groups composite, and each predctor group may have its own best composite. 
The usual intuitive understanding of multivariate relationships is 

cf>nf()iindcu bv the fact that predictors whose between-groups correlation 

equals zero may have a nonzero within-groufs correlation, or vice versa. Consequently. 
• inv geoiDL-tric analogy (e.g.. reference to "dimensions") is likely to go 
i-iriv. I Hh,ili return tc the two-predictor problem. 

A simple structural model will pt-rhaps be helpful (Figure 3,1). 

^ lot- grouping rule determines the division of X between the class mean and 
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Figure 3.1. Structural model for hierarchical analysis. 
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Figure 3.2. Structural regression based on data of Ca,npbeil and Alexander 
Numerals show unstandardieed regression coefficients; numerals in parentheses 
show standard deviations for selected components. 
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the deviation score. These t.'o are uncorrelated . Analysis may then proceed 

separately in the upper and lower tracks. The conventional "individual-level" 

analysis can be thought of as proceeding in the same manner, save that 6, and 

b 

are constrained to have the same value. 

The concrete example in Figures .2 is derived from a figure of Duncan 
e^al., 1972, p. 193; c here symbolizes a school, not a class. The original 
data were supplied by E. Q. Campbell and C. N. Alexander, Jr. (N = 1137.) 

Duncan e^. give correlations; I assume = s^ = 1 to get regression 
coefficients. Also, I assume 

linearity (i.e., that = r^U^lThe intraclass correlations for X and Y 
are 0.20 and 0.11, respective^, indicating that more of the variance lies 
between schools in the independent variable than in the dependent variable. 
This on its face suggests that schools do not cause divergence. Such an 
inference would be stronger if it were believed that SES is a sufficient 
specification of th? precursors of educational aspirations established at 
the time students entered school (including any preliminary statement of 
aspirations). The regression coefficient, however, is higher between 
schools than within, which on its face argues for a tendency of high SES 
•schools to cause aspirations to rise. (For more on this kind of reasoning, 
see p. 3.18.) The regression coefficients here are consistent with the 
c!iffer..no- in correlations noted by Duncan et^. (0.86 between and 0.43 
u'ithin); this wr>uld not always oe the case. 

liHerpr etation of components . To give a further sense of the implications 
ot the five basic components, Figure 3.3 displays regression lines like those 

Figure l.l. Regression lines for two groups are represented in Panels Ur-Um 
( The length, of the lines have no sij/nif icance. ) With two groups, 
Lne group means, fall on the be tween-groups regression line and > is 
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X X 
(i) (ii) 
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(iit) 



(iv) 



Between-groups regression 

Line through mean having the pooled -within-groups 
slope 

Specific wichin-group regression 




Figure3 .J, Possible relations among Pj^, 6„, and 6^ 
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necessarily 0. In Panel (iv) , four groups are shown. One of the y 
components is labelled. In Panel (i), the e component for a single 
member of Group 2 is identified. 

Regarding the residuals: The residual t includes any effects 

^c 

individual characteristics p brought to the 
experiment that were not fully represented in X, and errors that 
caused his observed Y to differ from his universe score on Y- (as I have 
stated the model, the average error over members of his class is added 
Into u, and removed from e .) Fur- 



c 



thermore, it includes any effect, unpredictable froai his X score and 
not common to other group members, that influenced his final level of 
accomplishment — an illness, for example. 

The residual can be thought of as an adjusted "group effect." 
Where the group is a class, y^ includes the "main effect" of the teacher, 
plus the effects of variations in the delivery of the treatment to this class, 
plus unrredicted effects that have a net influence on the mean (uncommon 
enthusiasm, an epidemic, etc.), and the average error of measurement in the Y 

P 

The individual-level errors need not have an average close to zero. 

The be tween-groupe effect described by. ^ reflects an^- 

b 

.onsisLent tendency of higher-X groups to do better than others (or 

vorse) nn the outcome measure. An exanplo already mentioned is the possibilitv 

" it tea. -hers cover more sround in abler cla<?ses. 

Ihe =o™-within effect reflects the tendency for students above 
ti-.- ^roup average to outperform (or underperf orrj the rest of the group. 

r.. r.--re^.ion coellirient is <|,.rived fmn -lata on all gm.. ^ con,hined; 

■' ""^ ' i"^ -T-v one >;roiip. Tht- educator ' ii--ial 

irit.-i pr-.lal ion i.f tlie uf fcL of aptitude on outcome is that students above 
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do better, regardless of their classmates. That amounts to a predic- 
tion that 8^ will be positive. If students are assembled into groups on the 
. basis of X information alone, and working in the groups 

makes no difference, 6^ will equal , as suggested in Panel (i). But 

the inference cannot be reversed. The fact that = 6 does not identify 

b w 

the causal situation (sse p-3»l'' )• 

The specific within-group regressions are likely to vary, but it is hard 
to know whether to take this variation seriously. The regression coefficients 
will differ by chance even when the processes operating in the classes are 
basically the same, "econd, insofar as the selection factors operating to form 
the several groups differ, 

the slopes will be affected. Third, and most interesting, are the possible 
differences in causal processes. Slope differences might come about, for example, 
if one teacher distributes attention to high--X and low-X students in a different 
proportion than the next does, or if some teachers set up a strong competition 
that encourages the able and discourages the others. 

The configuration in Panel (ii) suggests that instruction in 
high-X groups differs little in average effect from that in low-X 

groups. Within groups, however, the student's X level makes a substantial 
difference. Apparently, the treatment has set up 

a scarcity economy within the classroom, so that the comparatively able 
students snatch up the educationally useful experience at the expense of 
the comp.iratively weak students. If these are the results, a student near 
the .iver.i^e of the overall X distribution is much better oft In n class 
that, as a class, is below average on X. The student with hij^h X w<iihJ 
it ronplHh f.ir more in a wide-range cla^^^. but such a class can be'constituted 
only by bringing in low-ability students, who are sacrificed in his interest. The 
.ftHi^nt^ With low X scores accomplish more if pLu cd in i hom<v,cn.>uus low-X uroup. 
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A grouping policy is derived only by extrapolation, however. Would a 
strictly homogeneous group of students 

with uniform low scores on X fall on the between -groins regression line 

found in this experiment on heterogeneous groups? 

Panel (iii) shows g less than 

w b • 

^ Here, the student's final outcome depends a great deal on the 

level of the class, and little on his comparative standing within the 
aptitude distribution of the class. Perhaps such a config- 
uration describes what would be found if the graduates of a prestigious 
medical or business school and a run-of-the-nation school were assessed. 
The highly selective school probably offers a more intensive program of training 
Once the program is adapted to the level of the group, it may be that 
factors other than aptitude X account for differences in success within 
either group. Thus X might have been a valid predictor of success before 
the school programs diverged, and not after,* 

The slopes representing have various configurations. 

In Panel (i), - % is negativ e. Something happening 

in this particular class negated the advantage abler students have within 
typical classes, represented by the slope 6^ . 
C Tedious instruction might have this effect. 

It will be useful to recapitulate much that has been said by reproducing 
Figure 3 .1 in the form of Figure 3 ,4, with labels attached, to the causal 
connections. The labels are illustrative and not exhaustive; chance effects 
also enter the residuals. I have not separated the specific and common- 
within effects here. 
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Figure 3.4. Interpretation of causal arrows in Figure 3.1 
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What is easily overlooked is that Grouping Rule is a causal factor. 
Change the rule, and all the regression coefficients would change. This 
perhaps says no raore than that any coefficient applies only to a certain 
population of groups (Section /f ) ; but the present formulation, along with 
what is to come at page 3.11. emphasizes that the rule for assembling groups 
is often a manipulable, causal variable • • *■ ^ 

class coefficient for X is the regression coefficient that goes with the 
causal arrow leading to u^, and 1 - is the coefficient leading to the 



c 

deviation score. 



Partitioning variance 

The model leads directly to a partitioning of variance. The overall 
variance of Y divides into between-groupSand within-groups portions, and 
r each of these subdivides. If one assigns to each individual three 

predictor scores - X , and c , where the last is a string of coded 

c c 

dummy variables - a stepwise regression analysis will decompose the 

variance of Y. Predictors enter in f-h*^ n^A,.^ u 

enter m the order shown at left, and the change in 

the mean square for regression at each step estijnate^ a variance: 



Variance attributable to the between- (u ) 

c b X 

groups regression effect 
<^ Remainder of between-groups variance (y^) 

S Variance accounted for by common (X - u ) 

within-groups regression 
% ^ Additional variance accounted for bv Av. {^^ - "^)l'^C>' - ) ' i 

specific wi thin-group regression 
(Group '-Aptitude interaction) 

( R.Tiaindt r ) Unpredirted Individual variance -'^( ) 

c 

The sMfl-mres indicate the potentv of the five conponents to prodiwi^ 

diu>ren... in Y. Th. variances fur X, - ..^ and , arc defined 

* c c 

Q ovtT all groups pooled. 
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Choices made in forming the model 

In setting up any model one chooses between alternatives. 

Direction of decomposition . I have chosen to partition group effects 
out of the Y scores before evaluating within-^group relationships. The step- 
wise order shown above is just one of the possible orders( discussed e.g. by 
Werts, Duncan, or Pedhazur). The variance that might be attributed to the 
overlap of group and individual charac- 
teristics is assigned here to the between-groups component. Insofar as this 
is an arbitrary choice, not justif-ied-by-^-causadr-modei,-l^r-^^ ' 

ambiguity in interpreting 6, - 6 . 

b w 

It would have been possible to set up the model in just the opposite 
way, fitting a regression line to individual scores without regard to groups 
and then asking if group membership accounts for additional variance and 
covariance (see p. 1.17b, 1.22). This is a substantive decision. A considerable 
amount of variance in instructional practice occurs at the group 
level. It seems to me that individual differences are best 
studied by comparing persons treated under the same circum- 

stances. Members of the group are, in a sense, in the same circum- 
stances. In multilevel hierarchies, effects could be removed in 
many orders, only a theory about particular variables justifies one model 
rather than another. Thus it might be argued, \th respect to change in 
interracial attitudes after desegregation, that the school is a more critical 
unit than the class or the district. It might equally 
be argued that in improving the self-concept of individual children 
the class is more potent source of variance than the school. Some 
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other variable (truancy?) is perhaps more associated with the indi- 
vidual and his home, and less with the unit of instruction. Then, 
It would make sense to frame the model with the individual as primary. 
Such a model might start with an analysis of the pooled data from all 
classes, and then look at groups with unexpectedly high and low rates of truancy, 

Non linearity . Nonlinear terms could be added to the model. 
After extracting the common within-groups regression, one can reasonably fit a 

coefficient to terms of the form 6^(X - Hu^ ) ; the contribution 

c ' c c 

of the specif ic-within-group regression is reduced accordingly. This added 
member allows for the possibility that within-group slopes are 
linearly related to the class mean on X (as in panel iv of Figure 1.3). 
Whether it will be profitable to make this separation is to be judged in 
the light of one's prior beliefs about the phenomenon under investigation. 

It leaps ahead of the story to consider predictors other than X. 
Any effect of global properties of groups on within-groups 

relationships must come via a product term. A group property 

G is necessarily uncorrelated with Y - u r v 

p fJy , -r-^.-^ X (X - ) 

may correlate with Y - u *^ c c 

Nonlinearity couL also be introduced via quadratic terms [in ,2 , 
(Xp^ - u^j\ etc.]. In fact, one of the most striking ' 
findings of distinctive within-class regressions is that of 
Majasan (see Cronbach, 1975). Majasan predicted that measured achi 
m a college psychology class would have a parabolic regression on the 
«cudent.. BQ .cores. The BQ .core reflected 

behavi.ristic (vs. humanistic) statements. Majasan had a BQ score for each 
instructor and he predicted that (with aptitude held constant) the parabolic 
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regression would have its peak where the student BQ matched the instructors. 

He was able to confirm this prediction in 10 out of 11 classes, the exception 

being a class where no measured-achievement criterion was available. C^iajasan 

the 

could not investigate^between-class regression because the course examina- 
tion varied with the class.) 

There is a lively danger that regression techniques will dramatize 
relationships that arose by chance; and making hypotheses complex adds to 
the risk. Nonlinearities may reasonably be explored, but unless there is a 
rationale for predicting nonlinearity , little credence can be given a 
nonlinear relationship the first time it turns up. 

Ltfects of aggregating data 

It will be necessary next to examine the relation among , & , and 

b w % 

2 

fhe three are linked by the intraciass correlation n 

t 

(p. 3.2). My ideas on this subject have been formed over yea^s of discussion 
with Leigh Burstein, whose dissertation (1975) on the bias problem has in 
turn been influenced by his work with Hannan (Hannan ^ Burstein, 1974). 21y 
fornulation is structured differently from Burstein's in important particulars, 
but the lurmulas to be presented for in Figures 3.5 and 3.7 are consis- 

tent with his, 

Thf trauicienal problem of '^assessing thtf bias'* due to analyzing at the 

group level when is v/anted deserves little of our attention — we rarely 

v.'ant . Ue do want to compare fc, and c . 

t ' b w ' 
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The development that follows (and the highly general development that ends 
the section) lays out some algebraic, tautologies,. It does not depend in any 
way on substantive considerations, and would hold true for data generated by 
any causal model whatsoever. The analysis nonetheless fulfills an important 
function, in showins how numbers that are sometimes given a substantive inter- 
pretation can be generated by the aggregation rule. 

The argument is most directly understandable when two individual char- 
7>cteristics that exist simultaneously are to be related to each other, for 
example the potato and wheat yields of Yule and Kendall, or the ethnic and 
religious identifications discussed by Duncan et al . This is post hoc grouping; 
the effects have already been developed, perhaps in group settings that have 
no relation to the groups now being composed for purposes of analysis. Those 
data might be grouped in a number of -rbitrary ways; the joint distribution 
overall has been established. For the Y-on-X regression, how does the 
wlthin-groups or '-.he between-groups regression depart from the overall e•^^ 1 
The counterpart question can be asked about the X-on-Y regression. The 
answers will depend on how the bases on which groups are differentiated relate 
to X ind Y, or, what is equivalent, to x'and the partial variate 



Sncria l case with linear assumption . To develop a comparatively simple 
argument., I introduce a discriminant function W , and Assume that X, Y, 
md W '^ive a multivariate normal distribution. We may consider that groups 
•i.-i' formed by dividing W into regions and assigning persons in the same 
rerJon to the same group ft-r purposes of analysis (not necessarily for 
lr.;iLnent). It follows that the means of X, Y, and W are perfectly correlated. 
(It m this ronlition on which the imnediateiy following argument depends. 
K .n.ihl ho sntisfiPd when Lhe.,s^umpt ion of trivariate nornalitv in not. 
i?iit mich ,1 -St rone 

<.;■ '\ 
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assumption will serve for the moment.) In the rest of the argument I shall use Z and 

not W; Z is simply the group mean on W, and every member of the group has 
the same 2 score. 

Instf-ad of working with correlated variables I substitute orthogonal 
variables I, II, and III; these are perfectly correlated, respectively, with 
X, Y-X, and Z-Y, X. Each of these components has a zero mean and a unit s.d. over 
individuals in the population. The relations to be developed in this population 
would be found for samples, if sample statistics were used to define I. II. and 
HI. Variables I. II. and III can be thought as coordinates in a three-space; 
i and 11 .ire orthogonal coordinates for the X. Y plane. 

Figure 3.5 shows how the standard scores on X. Y, and Z may be described 
in terns of component loadings. (To use standard scores simplifies the argu- 
ment Without loss of generality.) The symbol A is used for tne correlation of 
X and y, overall, and CB for the correlation of X and Z. This notation is 
used because B proves to be a key parameter; all Z that project into the 
same Une of the X. Y plane have the same B. since dimensions I and II carry 
the information in X and Y , and is the proportion of variance in Z 

that I and II account for. C = R^^^^^ . As a convention, C takes a 
positive sign. 
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if = 1, Z lifcjy in the X, Y plane and therefore must be a continuous 

variable, ihis can occur only if the grouping procedure sliced the W scale 

2 

inlo inliniiely thin slices, hence it i? a hypothetica] limiting case. 

As Figure 3-5 shows, the covariances within any group (i.e., among 
perj>uns with Z constant) can be described by a partial-covariance formula. 
Tlien simple subtraction pjroduces the between-group covariances in the popu- 
lation. No matter what the value of C the becween -groups regression 
coelficient is the same. The relation of 6^ to B^, then, depends on B and 
not on A or C.^ 

This formulation applies to the population. As with C = 0 , 
C = 1 is a hypothetical 

limiting case. A finite collection of individuals can be regarded as a 
population, hence the formulas can apply to them. If the grouping 
procedure is strictly random, however, the correlation of X with W 
and that of Y with W will not be precisely zero. Consequently, C 
will not reach 1 . With purely random post 

assignment, any one group is a random sample of the total 
collection of individuals and its within-group regression coefficient is 
an unbiased estimate of 6^^. . Consequently, with purely random grouping 
subsequent to treatment of individuals, 

'""^^ ' X ' ^ ' rorculas for and re.ain the sane, wh.n 

Hfu^^r restriction is ren.oved ip. 1>.IU). 



X 



II 



III 



Y A + 

Z CB C/l-B-^ 



X 
Y 
2 



Overall covariances 
X Y 



1 


A 


CB 




1 


CAB+c/l-A^ /1-B^ 
1 



5.,^ overall = A 



i^Yx within groups = g^x-z 



, BC-/1-A-^ /l^ 
" ^ 1 - C^B^ °^ 

A - -iiL. /l-A-' tan ■}. 



X 
Y 
Z 



Within groups covariances (Z partialled) 
X Y z 



1- (? I? A -C2ab2-Bc2 >^ /ilp" 
0 0 



i^Y^ between groups = A + 



/l-A'^ v^l-B^ 



tan = 



or A + /l-A'^ tan 4. 

/l-lr^ "^Y - ^X^XY 



'X'^^-^^XY 



Byjj between groups = ny/l 



Y"'X 



See Note 2. 



Between groups covariances 
X Y 



X 
Y 
Z 



C^B- = c^b[ab+/i-a2 /i-b2] 



CB 



c2[AB+v^1-A- /1-b2]-' = CAB +C/1-A'^ /1-b2 



First part of Figure 3 .5 



at 3.14 

Values of regression slopes when C = 1 



B -1 -y^l~A^ -A 0 

I coincides with -X X-Y yx 

-^IX 1 1-A2 a2 0 

Uverall slope A A A A A A 

between-groups slope^ a 0 2A-1/A indet. 1/A A 

Wichin-groups slope^ indet. 1/A 2A A 0 indet. 



A 1 
Y X 
A2 1 



Fl-ure 3.5. Regression coefficients as a function of parameters 
of a single grouping variable under a linear assumption 
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the values of 6^ and &^ both approach as the number of groups 

becomes larger. 

How does the discriminant function affect 3, and 6 ? 

b w 

Figure 3.5 gives formulas in A, B, and C, and, more simply, in terms of ^ . 
Let X* be the projection into the X, Y plane of the line along which 
the means X, Y lie, and define as the angle between X* and X ; 
cos ^ = B . 

Figure 3.6 is for the case = C.40; a horizontal broken Jine repre- 

sents . One curve displays tbe value of at each ^ from -90° to 

90^; this curve would be repeated in the range 90** < i^ <^ 270°. A second 

function represents values of with C = 1; this is the unrealistic case 

into infinitesimal regions, 
where W is divided ^ The third function gives 0^ when C = 0.8, As C 

declines toward zero the line for $ comes closer to that for B 

w t 

Obviously, the relation of S.^ and depends on the relation of the 

grouping variable Z to X and Y. This is similar to the effect of "restriction 

ot range" "truncation" on test validity — a problem well known in psycho- 

Derris. i interpret these results before offering a more general development. 

M.-anin^ of X . Traditional writings on a^^gresat ion bias have thought of the 
^.r-uping pri„,-iple as on-, that could be nrbitrarilv established. Thus Feige 
..nd W.u^. (1972, were hn-king for a way to group banks into small sets so that 
the FoJ.T.,1 Kos.vve Svstom rould report data for the sets without violating 
-'n»id.-ntiaHtv. and v,i the data would represent the ir.i.-roo. .nomi.- relation. 
.H!..<.uat..tv. Th. r.^rmulas of this section ippiv .... h a .as., but ,r. 
■.!^> in,..r.,r...i in appl-ing them to .roups that ...n- l.-rnH bv natural pn-esscs. 
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3|_ » which defines Y*X , is determined from the overall distribution. 
In an experimental treatment, the values of Y reflect any predictable and 
unpredictable effects including context effects. The grouping variable may 
be related to Y*X for many reasons. The most startling paradox 

is that if groups are formed strictly on the basis of X there may nonetheless 
be a relation of Z to Y*X , hence a non-zero . Suppose that grouping 
is based on X , and no other initial characteristic adds to the prediction 
of Y . But suppose there is a context effect, such that high-X groups have 
high Y means. If one partials X out of Y , using the overall 
regression coefficient , the high-X groups will have a positive residual 
Y*X and the low-X groups will have a negative residual. Then <p will be 
positive;. (An inverse context effect, with high-X groups doing comparatively 
badly, will make 't negative.) A positive t could also arise in the 
absence of any context effect if grouping takes into account some W that 
is correlated with individual values of Y-X . An unlikely third possibility 
is that grouping Is actually based in part on the outcomes of the treatment. 

Cv^rr.pjring and ^ . In the sociological literature th^re is a line uf 

_D V 

rvi-fming trcit runs like this: If &, eauals £ , there are no context effects. 

D ■ t 

Or, ecuivaleni ly , if c = ji , there are no context effects* Thus one can 

w w 

r^fgress college plans, as in Figure 3.2, on 6LS within schools. If i.^ = .L^ , 
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Duncan ec_al. (1972, p. 197) would dismiss the hypothesis that the school 
cliuate had an effect (direct or indirect) on aspirations. In such a case 
they suggest that the between-groups slope merely aggregates evidence of 
effects at the individual level. 

Duncan ec al . (p. 195) expand on their version of Figure 3.2 (in a 
manner that loses something in my compression and in translation to the 

regression form). Essentially, they substitute for the u„-to-u path a 

X Y 

diamond-shaped configuration that makes room for intermediate variables Y 

c 

and Y^- , whose sum is . Y^ = b^,^^ and hence is a predicted school 

c C 

mean under the hypothesis g = 8 . In the population f u would surely 

be the between-groups predictor if classes had been formed at random ^nd 

It no causal effects at the school level entered into u or Y . The part 

^c P 

ot that IS left over, , is analogous to the adjusted mean in analysis 

of covariance. 

For the Campbell-Alexander data the coefficient linking to Y is 

A C 
C 

set equal to b ^, , i.e., 0.A7; and the coefficient leading tc Y* is 0 63 - 

c 

0.4 7 - U.16. The standard deviations of Y and Y* are 0.21 and 0.07 

c c ' 

rcsp.-ctively. 411 this suggested to Duncan et al. that "composition effects" 
{'Jcnanmphic) ire much stronger than "school effect-^" (grouf .aused). This 
<in<i ci infsrcncc- dor-s not appear to be justified.^ 

In the Canpbell-Alexander data ^.^ = 0.2U. Suppose that 0.50 is th.- 
r.--.T.-s.siun covfticiitrnt ol college plans on SEH that would occur if each 
st.i,i.-nt -vere r.iraru h.us ly ... grow through adolescence completely Insensi- 
mv etfert of the c.roup around hin, so far as his . . lleiie j.lar..^ 
-n. .rned. I nJ.-r tui-. no-sclu.ol-et f ects hvpoihesis, if siudenl.. vvr.^ 
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2 

grouped nonrandomly with = 0.20, the values of and fc^ could be those 



of the Campbell-AleKander data; 0.63 and 0.47. What is required under the 
linear assumption of the model above is that 

tan ; = 0*15, which implies that grouping depends much more on SES than on 
other correlates of educational aspirations. IVhen grouping has no causal 
effect, a difference between 6, and will occur unless 4. = 0. That is, 

D W 

a nonzero - certainly implies a causal effect at the group level only if 
the discriminant function combines the predictor X (in this case SES) vjith in- 
formation unrelated to the dependent variable. This allows random grouping 
as a limiting case. So much, then, for the attempt to infer a causal grouping 
effect from a difference between the two coefficients. 

It is a little less obvious that a zero difference in coefficients can 
arise in the presence of a group-caused effect. For the sake of argument T specify 
the grouping variable: students are assigned to schools strictly on the basis 

of SES. If we continue to suppose chat group context has no effect, , = c = 

b w 

^ , i>ince : = 0. Now suppose that a causal effect of the Coleman type is 
added. In high-SES schools aspirations of the student body as a whole are 
given a boost; in low-SES schools aspirations are lowered. This context 
eftect raises o, but does not imply a change in g . In consequence, £ 

b WD 

exceeds tiie original ^ This seems to fit v'ith 

the common reasoning of sociologists. But Meyer among others has suggested 

anetner tvpe of context effect. A student's aspiration tends to be raised 

when he finds himself superior to his schoolmates, and lowered when he is 

below the mean. This has the effect of raising c , above the original 

w 

' t ' Hence effects of the Coleman md Meyer types mav o^'fset 

and leave c^, = B . 

b w 

e.j< h other^^ Also, a group-caused effect may offset a demographic effect. 
This reminder <>l the 
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arguraent developed at p. 2.6ff. indicates that the absence of a causal group- 
ing effect does not follow from 0. = i 

It should be noted that nothing in the model itself implies the presence 
or absence of a causal effect from grouping. Consider the possibility that 
groups are formed prior to the events that determine Y ; so is prede- 

termined. Let one collection of groups be treated as groups. Let members 
of equivalent groups be treated as individuals. If grouping has a causal 
etfect,^- and will almost certainly not be the same in the two data sets, 
hence the ..^ , values will not be the same. But the formulas of Figure 
3.5 fit both studies. This in itself implies that t here is no wav m ...... ' 

''-^^ ^bout the e ffects of grouping, on .he basis solely of Y-o n-X 
or sinilar data . 

Ttiis discussion takes on added importance in the light of Alwin's paper. 
Alv;in .hews that the analyses preferred by the Columbia group of sociologists 
and t,.e analyses preferred by Duncan, Hauser, and others of their persuasion 
lead to identical conclusions, in the multivariate as well as the univariate 
ua.^t-* Jhe tact chat - cannot be directly interpreted implies the need 
tor T,i'-rk^ elaborate models and for the collection of more evidence on the 
prtsun^ed r;t'diator of the context effect. In the Meyer case, for example, 
cviuen^.e on s«-jli-perceived ability would not be hard to collect. 
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Implication for ATI research > I stc-rted this investigation with 
a concern for contrasting regression slopes across treatments ♦ If, I 
said, the within-group-wi thin-treatment slopes were the same across 
treatments, and the between-group slopes were different across treatments, 
this suggested an ATI at the group level (i.e., a causally interpretable 
group effect). Such a commonsense view must be modified to recognize 
what has just been said about aggregation effects. 

To glimpse some of the problems, let us assume that the relation 
of Y to X is strictly individual within each treatment (i.e., that 
grouping has no behavioral consequences) ♦ Assume also that the two 
Y-on-X regressions have an identical positive slope for all individuals 

pooled. But suppose that the grouping principle used in one treatment differs 
from that used to form groups in the other. Then, of course, any 
analysis of between- and within-groups information refers to 
different populations of groups, even if the individuals came from the 
same population of individuals. And any difference found in comparing 
statistics may be attributable to the grouping rule. 

NO 9? Suppose that in one treatment groups are formed by random 

sampling. Then, within the limits of sampling error, b, = b = b 

D t W ' 

Suppose that, in the second treatment, group formation is 
uenced by some variable (other than X) that predicts Y well. 
Then • r , and strong ATI effects will be reported at 

the between-^iroups and within-groups levels! It would be pos.sLbie 

to contort .in example in which there was no intt^raction between <^r 
wiihin Kfirups, hut an overall Interaction did appear. 
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If an ATI study sets out to compare two treatments that are 
already in place, how groups were formed is crucial to the interpretation. 
Group_s_served by one program may differ in their demographic makeup from 
those served by the alternative program, even though both sets of groups 
cover about the same range of individual differences. 

Perhaps this section can be summed up simply by saying 
that interpretation of regression coefficients must be exceedingly 
circumspect when grouping rule is confounded with treatment, so that 
each treatment is observed in a different population of groups- 

Aggregation effects with multiple discriminants . A 

general formulation can be offered to replace the simple one used to this point. 

Consider two discriminant functions W and . Persons 

are grouped by imposing assignment rules on the joint distribution of 
corresponding 

W and . The group means will be denoted by Z.and . All persons 
within a segment are assigned the same Z and Z^ . It is easiest to 
conceive of slicing first on W and then on W* ^ 



(If W and fell in the X, Y plane, 
this would divide the original distribution into lozenges.) It is not 
necessary to make the division by sharp cuts, however. As before, the 
model applies to data where the investigator did not control assignment 
to groups; the role of W and is to provide a sufficient simulation 

of the grouping process that might have occurred. 
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I require that the group mears X and Y be 

2 2 

linear functions of Z and , i,e, that P^.2,Z^ " ^•ZZ' ^ ' '^^^ 
only slightly restrictive. Given any distribution of 

X,Y one can easily form a Z and Z' that will reproduce the first and 
second moments (but not necessarily higher moments) of the distribution, 
(E,g,, let Z = X and = Y - ^yx^ * Many alternative pairs of contin- 
uous W and W' will generate this Z and Z' respectively.) 

This model is sufficient to reproduce first and second moments when 
grouping was regulated by complex contingency rules, since any such rule 
still leaves uswith an X,Y pair for each group, hence a Z,Z' pair. The 
model would need to be extended to deal with multiple-regression problems. 
The development could be stated for any Z and Z\ tut 

no information is 

lost if Z and Z* are rotated within the plane, I therefore work with 
variables Z^ and Z^ which are orthogonal; I require that have no 
correlation with Y*X . 

Instead of using partial covariances as in Figure 3 ,5, Figure 3.7 
proceeds more directly to the results. Figure 3,7 starts v;ith a factorial 
model in four dimensions, with each variable assigned unit s.d. The multiple- 
regression equations for predicting X and Y from Z^ and Z^ are formed. 
Since all members of a group have the same values of Z^ and Z2 , these 
regression equations predict the X and Y , 

Despite the more complex model, the formulas match the results in 
Figure 3.5, when those are stated in terms of A(= p ) , tan J and • 

A 1 A 

2 2 2 ^ ^ ^ 

The only functional difference is that - C B Hh , not C^B" as 

betore. Figure 3. 6 applies to but not to , since holding C 

constant does not simplify adequately, A figure can be developed holding 

Q*' and C constant, allowing ^ to vary, ( B implies I ). 
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X 


1 


Y 


A 


h 


CB 




D 









II III IV 



+ /1-a2 



C/1-B2 /1-C2 

E /1-D^-e2 



CBD + E /l-c2 = 0 



X = X = CB + D 



Y = Y =■ (ACB + C /1-a2 /1-b2) + AD Z^ 



,2„2 



a(X,Y) = A C B + C B /1-a2/i-b2 + AD' 



S = <?(Xj^Y) - A 4. C^B /1-a2 /1-b2 
b a2(X) - A + ^2b2 + d2 



= CB ( CB I + C/1-B2 II + /l-c2 m ) 



+ D ( D I + E III + /i-d2-e2 IV ) 



,2„2 



X = (C + D^) 1 + C B /i-b2 II + .... 

If we write X for the projection of X into the I, II plane, and 

define t as the angle between X* and X , tan c = C^B /l-B^ / ( c^B^ + D^) 

Kence 

0. = A + /1-a2 tan ^ 



w 



A - 0(X^Y) _ A - A C^ B^ - C^B /l-A^ /l-B^ - A 
1 - o'^iY.) ~ T~Z (C2~p~~rp) 

A - B /TIF /rA2 tan .| (C^ B ^ + D^) 
1- (C2 B2 + D2) - ^ " 1 - (C2 B^^ + D2) 

2 

^X 



= A - /l-A^ =■ tan i 

^ Figures .7* Regression coefficients as a function of parameters derived 

EBs£C f^^^ grouping variables. 
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As before, 6 - ^[ /l-A^ tan ^] (-—-) 

D W 2 



2 

If n 0 , is indeter- 

minate* If - 1 , is indeterminate. Disregarding those cases, 

- ^ has the same sign as tan <j • 

D W 

Assume that and p^^ are positive; this only polarizes those 



variables. Then tan -t can be negative only if /l-B^ is negative, that 
is to say, variable is negatively correlated with Y*X . This can 

arise from a causal effect that places high-X groups at a disadvantage 
(including a Meyer-type context effect). If 

there is no group-caused effect, the negative value can arise from an assembly 

rule such that groups containing more high-X persons tend to contain persons 

who are low on some other predictor of Y . For example, if pupils were 

assigned to classes on the basis of IQ (which can be interpreted as a function 

of yA - Age), the highest group will be high on MA and low in Age, on the 

average. Then if MA is used to estimate or forecast achievement, these 

c-^'ditions would make < * Grouping on degree of "underachievement" 

m. produce a similar anomaly. It appears unlikely that demographic 

effects alone will often make < & . 

b w 

The relation = occurs with grouping on some combination of 

X with an irrelevant variable (perhaps a random assignment process) . 

This requires that there be no demographic variable or other precursor X* 
such rhat * t 0 That is, X completely specifies the 

relevant grouping variables. 
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Notes for Section 3 3-25 

p. 3 a \'alberg (personal communication) suggests that the mean should not be i2Sed 
as an aggregate statistic because of its sensitivity to skew and especially 
to outliers. Decker Walker suggests that the model should provide for non- 
linear regression from the outsets (and this does appear to be important 
with a categorical variable such as that of Bowers). I prefer to leave these 
possible elaborations in the background- Investigators should inspect plots 
of betweeii-groups and pooled-within-groups relations. 

At different places in this report I have shifted notation, more because 

the sections were drafted at different times than for any good reason. What 

appears here as was sirply X in Section 1 and X in Section 2. 

c ^ 

2 - _ 

p- 3.2 Others have used E , the correlation ratio, in place of . When 

X 

class membership is fixed, as I assume, the two are identical, 
p. 3.14 "When Z coincides with X and each differential element of Z defines 

a new group (B = -1, C = 1, * - 0), the variance of X within -groups is zero 
and the slope is undefined. With B = ±1 and C even slightly different from 
zero, the within-groups slope becomes ^ . 

Wlien Z coincides with Y-X or projects into Y'X (o = 90*"), the 
ber--een-groups variance is zero and the between-groups slope is undefined. 
As : increases toward 90^ the slope becomes indefinitely great; as 
decreases from 0* toward -90% the slope becomes indefinitely great but 
negative. 

statements about r-,, and n^. in Figure 3.5 are true when both 
•ire gi'/en positive signs. 
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i.l7 Personal communication indicates that Duncan does not wish to defend the 
argumeni: discussed here; it was formulated nearly ten years ago. Today he 
would emphasize that coefficients are highly equivocal unless there is com- 
mitment to a causal model of the process of group formation and of the 
generation of the dependent variable. The Alwin paper indicates that the 
^ reasoning of the 1972 publication has not been superseded in the sociological 
literature. 

4 

p. 3.24 ''^^1^ ^ finite number of groups a strictly random process may generate a 
value of t that is far from zero in either direction. 
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4 .1 

4 . The reference population and its parameters 

Alternative models for statistical inference 

Data on students observed in a group of classes could be interpreted 
with no attempt to generalize. That is, the classes, and the students 
within the classes, could be regarded as fixed. (One could consider the 
data themselves as a sample of observations that might have been made on 
these same subjects, in which case one would generalize over the universe 
of observations.) 

The most obvious way to frame a generalization is to assert that persons 
and classes are randomly sampled from a population of students. This re- 
quires drawing students randr-aiy and independently to fill each class in 
tarn, which would make approximately zero the intraclaes correlation for 
every initial characteristic of the persons. This is not reasonable for 
most groups that exist in society, and it is likely to be contradicted by 

0 

the data in hand. 

In trying to identify more plausible alternatives, I confine attention 
to two levels, collectives and members. The ideas apply to 
subcollectives as much as to individual members, and are readily extended 
to additional levels. I assume that it is intended to generalize over 
collectives, and that the collectives are a random sample of the population 
of collectives. Collectives, then, may be considered "random" as that 
term is used in the statistical literature. 

Tn deciding whether to treat members as fixed or in some sense random, 
Che key lies in the structure of the population. Are all the collectives 
separate and distinct? Or do different sets of members constitute 
realizations of "the same collective"? The second alternative applies most 
obviously when the population of collectives extends over localities and 
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over time. High-school student bodies have different members each year* 
An Investigator interested in persistent differences among schools might 
think of the population as comprising a number of "local" populations, 
each made up of a succession of student bodies in different years. (The 
population may be finite.) Classes within a population of classes might 
likewise be identified with teachers, so that the potential members of 
classes of one teacher constitute a local population. I next discuss the 
model for inference that follows from each of the alternatives. At the 
end, it will be possible to discuss the bases for choice between the two 
conceptions. 

Collectives distinct, persons fixed . In the first formulation, 
collectives are regarded as without connecting identities, just as persons 
are in the usual models for inference. Collectives are sampled independ- 
ently, from a population of collectives that might have been formed by 
applying a particular grouping rule to a population of individuals. The 
grouping rule may be under deliberate control or may be a social process 
that can be only inferred from the data. 

It seem? to me that under these circumstances members have to be 
regarded as fixed. A certain group of persons was assembled and together 
went through certain events. Those events constitute a unique history. 
There is no basis for speculating as to what would have happened if, at 
the outset, Billy had been replaced on the class rolls by Hilly. Wiat went 
on in the class may have been influenced by the synergism between Billy and 
certain others; to have enrolled Hilly instead would have made the class 
a different experimental "subject". The class containing Milly might have 
been drawn under the grouping rule, but it is a distinct class and only one 
of numerous alternatives to the class observed. The model does not allow for 
a close family tie of the class with Billy to particular other classes in the 
gj^^ populat li>n, except as classes may be blocked a post eriori on selected 
aggregate and global variables. 101 
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With students fixed, effects within the classroom have to be looked 
upon as historical accounts of the consequences of bringing together this 
set of students, this teacher, and whatever unpredictable events affected 
the group- The unforeseeable variability in delivery of the treatment, in 
classroom morale, in epidemic illness, etc, is a part of the causal history. 
As a thought experiment one can ask what would have happened if this same 
collective had gone through the specified treatment several times independ- 
ently. That is, one can be conscious that fortuitous events played a role 
in determining the history and the scores. But there is no satisfactory 
way to assess such variability. The one empirical approach is to treat 
successive units of instruction in the class as independent events, but 
even if the topics are unrelated, the first experience is likely to influence 
the second. It ^s practicable to generalize over the universe of observa-- 
tions of the outcome — but that is a side issue here. 

Collectives nested within local populations . The alternative 
recognizes the division of the population of members into what can conven- 
iently be called local populations.^ The grouping rule determines the member-- 
ship of each local population. Random definition of local populations is 
unlikely. Each local population is a subpopulation of the population of 
collectives defined by the grouping rule. It consists of all the collectives 
belonging to a certain locality ~ i.e., all the "remedial" sophomore English 
classes that might be formed in this school during a 10-year period. Here 
collectives are simply nested within localities. One might think of 

crossing localities with time periods and identifying a place-time combination 
with a subpopulation. 

Student bodies, neighborhoods, and classes are not formed randomly, as is 
evidenced bv the usual intraclass correlations on initial characteristics. But 
it seems not too unrealistic to assume chat any one collective within a local 
population is a random sample from a set of collectives that might have taken 
ERiC ^""^^^ populations of classes are identified with teachers, 
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where each teacher*has many potential classes, I allow the intrateacher correla- 
tion on an Initial variable to depart from zero, and assume that the intraclass 
correlation within a teacher fluctuates around chance expectancy. 

Even with this model, sometimes it is appropriate to regard the 
collective observed as having a fixed membership and generating a unique 
history. Then one could evaluate the relevance of collective-level statistics 

to the subpopulation only by observing two or more 

collectives from that subj -pulation. 
The independence assumpt^ n, 

^When the investigator chooses instead to regard the members as random, 
he is making two assumptions. He is assuming that one member of the local 
population has as good a chance to fall into the sampled collective as 
another, which seems plausible. Second, he is assuming that as the events 
of the treatment period unfolded, each member's history and performance 
developed independently of the experiences and acts of his classmates. 
This seems more likely to fit the facts of individual instruction than of 
group instruction. But let me be more precise. 

I elaborate on the model of Section 3. Assume a population of sub- 
populations for which there is a single and 6 . There is 

b w 

corresponding population of values of y„ , y » and 6 - 6 , one value of each 

c c w 

c 

foj. each subpopulation. Here, u is the expected value of X over 

c Pc 
members of subpopulation c . ^ 

Second, there is a grand population of deviation 
scores for members, X - and Y - * = \\x^ + Y^.) The variances 

of the means are a function of the intraclass correlations. 

With this model of the population, the logical design is to sample 
local populations and then, to represent those chosen, to sample one or 
more collectives. The calculated and a - 6^ for a particular 

sampled collective estimate corresponding parameters for the subpopulation. 

er|c . ^^^^ 
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tl To evaluate the sampling error when only one collective per subpopulation 
has been observed, one uses the member as unit of sampling and estimates 
the variability of the mean or regression coefficient from the within- 
collective variance. 

This amounts to viewing the members as independent instruments for observing an 

effect, an effect that is associated with the subpopulation no matter which members 

constitute the collective. The obvious example is a teacher effect. The 

teacher may be supposed to generate an effect of size in every class, 

b. virtue of excellent (or poor) technique. Individuals affect the mean 

on Y through their aptitudes, but if the model is properly specified 

that contribution is separated from . Likewise, the teacher may 

adopt some tactic, such as fostering competition that generates the 
efimn a 13 • teaches. 

same - ^ every class^ The independence assumption fits well with 
some conceptions of teacher effects, school effects, and context effects 
generally. 

It is not hard to generate counterexamples, starting with contagion 
effects. Most teachers have the impression that classes develop their 
own "personalities" - responsive, recalcitrant, mutually supportive, 
divergent, etc. This implies a variability across classes within the 
subpopulation much greater than one would estimate from the within-class 
variation. On the other hand, consider the teacher who "grades on a 
curve", so that every class has almost the same final rating. Then the 
variation of mean ratings across classes will be much less than one would 
estimate from variation within the class. 

Choice among models . The ^irst summary comment to make is that, for 
many research purposes, inference regarding members of collectives is of 
secondary importance at most. In compensatory eduction, the chief 
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question is whether, on the average over (presumably) districts, one 
policy is more profitable than a competing policy. An experiment on 
instructional method usually seeks a conclusion that can be generalized 
over classes or possibly teachers. For such purposes, the collective is 
the unit of decision making or of theory, and it seemingly should be the 
primary unit of sampling, assignment, and analysis. In such a context, 
however, an investu.gator might appropriately make supplementary studies 
of classes, asking why some have large means or 

large regression coefficients. At this point, he does face a choice between 
regarding the class statistics as representing a fixed history or as representing 
^he independent histories of its members. 

To think of members as random and independent appears to require, 
first, an iden*-ification of local populations. To simply say "pupils are 
regarded as random" (for example) is to make a deniable assumption of zero 
intraclass correlation. In the kinds of studies this report discusses, 
local populations are readily conceived, so that is no barrier. The point 
is primarily important in stressing that the variance over members in 
pooled collectives is not a proper basis for inference; the model directs 
attention to variance of members within collectives. Inference based on 
this variance has to do with parameters within local populations, not with 
inference over the undifferentiated grand population. 

Second, a substantive decision is required as to the legitimacy of 
the independence assumption. Bowers might want to set confidence limits 
on the mean proportion cheating in College A, over its subpopulation of 
successive student bodies. He surely would not regard students as 
independent; his very hypothesis regarding the context effect seems to 
imply a positive feedback loop among the members. Some years, then, are 
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likely to see more cheating than others, and within-year variance of 
students would tend to give a conservative confidence interval. An investi- 
gator evaluating programmed instructional materials might be convinced 
that students react to the materials independently, so that each one earns 
about the same score as he would have if taught alongside other classmates. 
Then he can contentedly regard students as random while estimating the 
extent to which certain teachers get superior results, /r'ando^withln'the subpopula 
for the teacher, however.) Third, think of research like Barker's, where the local 
population is the community and the variable of interest is^umber of 

responsible tasks undertaken in the community. The variabilitrwithi^the 

A 

cohort will give too wide a confidence Interval for the community mean. 

The number of roles to be filled is finite, hence the mean over cohorts 

must be quite stable; yet there is a large variance within the cohort. 
An investigator who has only one sample from a subpopulation and 

wants to infer to the subpopulation must develop a substantive argument 

about the direction and amount of bias the independence assumption entails. 

He may go on to base an inference on this shaky assumption, 
with appropriate caution. 

In this report I have chosen to regard members as fixed within 

collectives. This is a conservative position that limits the number of 

Issues I have to deal with in each particular example. 
Weights that define parameters 

Population parameters have to be defined with due regard to the number 
of members per collective, when this is not constant. Parameters may 
weight by the number of members or may weight collectives equally. This 
requires a conscious choice of the parameters to be estimated. To be sure, 
a person who is interested in the weighted mean for school districts may 
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use the unweighted mean as an estimator, assuming that the correlation of 
the variable with district size is negligible. But the fundamental question 
is what mean (or variance, regression coefficient, etc.) he would like to 
evaluate. Sometimes the decision can be reduced to a theoretical question 
and sometimes it is a question of utility. The same weighting should, I 
think, be used in defining all the parameters of the study, to avoid 
numerical inconsistencies. 

I am inclined to think that in most instructional research weighting 
pupils equally is the preferred way to define parameters. It is doctrine 
in our society that individuals are equally important, and in any ultimate 
policy decision the burden of proof is on whoever proposes to weigh pupil 
interests unequally. That is, if it should happen that the mean effect of 
a treatment is positive when calculations are weighted by class size, and 
negative when unweighted, "the good of the greatest number" would favor 
use of the treatment. Weighting classes by n^ weights individuals equally. 

Theory may give a reason for weighting on a principle other than "one 
man, one vote". In research on factors influencing national returns for 
Senate seats, the fact that each State has two Senatorial votes might 
argue for using unweighted State means. This, it should be noted, arises not 
from a statistical principle but from a substantive context in which States 
are equivalent in weight. Weighting groups equally can be appropriate in 
education also. One example is an evaluation study with wear~and~tear-on- 
the-teacher as dependent variable. Teachers, In that instance, are the 
ones with equal rights. 

t'/hen the State of California wants to examine the mean of student 
achievement, it might count districts, or schools, or pupils equally. It 
seems obvious that pupils are the correct unit. If a change in distribution 
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of tax revenues depresses the school program in 20 large cities while 
improving the program in the 500 smallest districts, the effect on the 
welfare of pupils qua pupils probably is negative on balance. Surely the 
legislature is just as concerned with the welfare of the typical city child 
as with that of the typical child in a small community. 

Kith regard to the regression of district mean outcome on background 

characteristics, if district size makes much difference there probably 
a 

should be^separate regression in each size category. But if only one 
regression is to be used, the pupil-weighted regression for district means 
seems to give the best statement as to the "normal" district-level outcome 
corresponding to given background characteristics. 

Consistently weighted calculations produce harmonious 
numbers at several levels. For example, the pupil-weighted sum of squares 
between districts and the sum of pupil-weighted within-district sums of 
squares for schools add up to the pupil-weighted sum of squares for 
schools pooled. Inverse weighting is equally possible. If one wanted to 
weight districts equally in district-level calculations, it would be 
possible in school-level calculations to weight each school"inverse proportion to 
the number of schools in its district. 

The weighting that defines a parameter may not be the weighting used 
in making estimates, particularly if the sampling fraction varies with the 
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collective* One might, in California, sample schools in large districts 
while collecting data in every school in the small district* This would 
lead one to weight schools unequally in calculations over districts- 

In Bowers' study, data on attitudes and conduct were collected in 93 
colleges. The same number of students were sent questionnaires in each college, 
though returns were not uniform. The sample sizes were not at 
all proportional to the college 

enrollments* The population of interest could be defined by 
a* counting respondents equally, or 

b* counting colleges equally (which would call for weighting 
each sample mean equally and in individual-level calculations weighting the 
data inversely by the size of the sample for the person's college), or 

c. counting individuals in the national student population 
equally (which would call for weighting each datum by the ratio of college 
enrollment to sample size for that college). 
Bowers, it v/ill be recalled, 

was concerned with the relation of behavior to the dominant opinion in the 
student body. This is described in the between-colleges regression and in 
the mean of *:he within-college regressions. Option (a) — which Bowers 
and others used in their calculations — seems not to be the soundest 
choice. The resulting statistics refer to no population save that consti- 
tuted by the sampling procedure. Options (h) and (c) could give disparate 

- . distributed around 

results If means for large and small colleges ^^e not the same 

regression line, or if their within-college regressions differ systemat- 
ically. If the large colleges exhibit a positive trend and, arguando , the 
small ones exhibit a negative trend, these can baiance out in the unweighted 
calculation whereas the large-group trend will dominate the weighted 
calculation. I believe that Bowers would be Interested in a trend whether 
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it appears in the weighted or the unweighted calculation. In investigations 
like this, as in the California data, it appears important to learn how 
regressions vary with group sijie.''" 

If Bowers were to decide that size was not systematically related to 
the effects of interest, he might want to take the precision of his informa- 
tion into account in estimating the relationships. If student bodies are 
much larger than his samples, the standard error of each college mean is 
nearly inversely proportional to the square root of the sample size for it, 
"and the means could be weighted by that factor in the between-groups 
analysis. The same weighting could be used in averaging the within-college 
regression coefficients. 

Illustr ative statistics for populations of collectives 

The population of collectives, I have said, is characterized by a 

number of parameters at the level of the collective. Two examples will 

give concreteness to the idea. 

Head Start ... Smith and Bissell (1970) give correlations, means, and 

s.d.'s for a set of demographic variables and a posttest (Metropolitan 
^ Reading Readiness) on 202 Head Start children in 26 centers. The entries 

in Table 4 . 1 are calculated from their report. As the data come from a 
sample of centers they describe the reference population only approximately. 
The covariances of the initial variables are as much a part of the population 
definition as are the variances. In fact. Smith and Bissell described the 
data in such detail in order to point out that the Head Start group matche§ 
the control group poorly. Even though the means and standard deviations 
matched fairly well, the correlations were consistently stronger in the 
control group. POPED and NKXDS. for example, have a covariance of -0.39 in 
the Hend Start sample, and -0.77 in the control sample (between centers). 
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Table 4 Statistics describing a sample of Head Start centers 



Between-centers 
variances and covariances 



b with 
Reading 





Mean 










MP 


n 


(between cen 


Father's education (POPED) 


2.3 


.36 










.31 


3.47 


Father's income (POPINC) 


2.6 


.16 


.49 








.43 


6.06 


Father's occupation (POPOCC) 


1.0 


.07 


.20 


.25 






.20 


6.72 


Children in family (NKIDS) 


4.8 


- .39 


- .02 


.02 


1.0 




.20 


.72 


Metropolitan Reading (MR) 


52.2 


1.25 


2.96 


1.68 


- .72 


75*69 


.29 
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School di stricts in California . Another example comes from the 
California Assessment Program. Every student in certain grades is tested 
each year. Rogosa and I have analyzed data for 882 districts (4514 schools); 
this is not the entire population, since we confined the analysis to schools 
for which information was available on each of the variables under consid- 
eration^ These were: 

ELT 3. A readiness test given to first-graders entering in 1973. 
ELT 4. A similar test given to first-graders entering in 1974. 
Rdg. A reading test at end of third grade, given in 1975 to 

students most of whom entered in 1972. 
SES. An estimate based on teacher's report of father's 

occupation for each third-grader. 
Mob. Principal's estimate of per cent mobility for the school. 
Bil. Teacher's estimate that the pupil was or was not bilingual. 
Calculations can be made directly from school means and from district 
means, but in the distrxct-level calculations we weighted by number of 
schools. (In retrospect, we had better grounds for weighting by number of 
pupils. ) 

Table 4 .2 gives results for all districts except those having just one 
school. The correlations are large for all variables except mobility. The 
intraclass correlations are larger than in the Head Start data and, except 
for mobility, remarkably uniform. 

Even in this weighted calculation the elimination of the one-school 
districts had a large effect. The cLT-3 vs. ELT-4 correlation dropped from 
0.95 to 0.84. No interclass correlation increased. The standard deviations 
for schools did not change but those for districts increased. Consequently, 
Che Intmclass correlations rose to about 0.50 (0.40 for mobility). To 
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Table 4.2. Population parameters for California districts 





with more 


than one 


school. All calculations weighted by 








number of schools 


per district. 








b .d . , 

schools 
pooled 


s.d. for 
districts 


Between-districts correlations 




Variable 


Mean 


bLi J bLl 4 SES Mob Bil 


O 

Z 

n 


ELT 3 


29.04 


2.18 


1.41 




.42 


ELT 4 


27.45 


2.49 


1.64 


.95 


.43 


SES 


2.16 


0.41 


0.27 


.79 .80 


43 


Mob 


39.56 


11.97 


6.18 


-.16 -.16 -.28 


.27 


Bil 


0.18 


0.19 


0.13 


-.75 -.74 -.61 -.02 


.47 


Rdg 


82.34 


9.18 


6.02 


.89 .89 .81 -.23 -.67 


.43 
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analyze the one-school district separate from the remainder makes sense; 
and in the population of larger districts it is advisable to check that 
relations of interest do not vary with district size. 

Los Angeles (441 schools) is four times as large as the next largest 
district, which led us to wonder whether Los Angeles alone had an appreciable 
influence. The correlations with Los Angeles omitted departed little from 
chose in Table 4 .2. The s.d. 's for schools decreased by about 10 per cent 
(except for SES and mobility) and increased. Removing Los Angeles had 

little effect on most of the statistics because its niean was close to the 
State mean. Had it been an outlier on any variable the changes would have 
been great. 

A problem of estimation 

If one wishes only to describe relations in the sample of groups and 
individuals before him, it is unnecessary to speak of "estimation". Calcu- 
lation simply requires attention to the definition of the various components 
and parameters, with respect to such matters as weighting. I postpone 
most problems of inference to Section 7. One point needs to be made here, 
however, to prepare the reader for the erratic behavior of the between-groups 
coefficients to be encountered in Section 5. 

A regression coefficient is determined largely by the cases toward the 
extremes on the predictor variable. Those cases "have leverage" on the 
slope of the regression line, just as do persons perched on the end of the 
seesaw. Cases near the mean - the fulcrum - have little influence 
on the slope. This means that the "effective" sample size determining 
a regression coefficient is much less than the number of sampling units. 
^^.11 Note 1. We made some limited comparisons in some of the Bowers data and found 
that regressions were similar whether colleges or individuals in the 
example were weighted equally. We did not apply weighting of type (c). 
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5* Illustrative ATI 'studies 
The Anderson study 

Before considering theory further, I turn to a number of illus- 
trative studies, beginning with G. Anderson's 1939 data, Webb and 
I reanalyzed that study because the ATI effect it reported 
has been of considerable interest. A full account of the design of the 
study and of our reanalysis appears in the Cronbach-Webb (1975) paper, so I can 
be brief here. Dg^a on 9 classes in Treatment A and 

8 in Treatment B are available. The classes were taught the same year- 
long arithmetic curriculum - the A's by a method that emphasized the 
meanings of the processes, and the B's by a drill method, „ith little meaning ' 
being developed. Teachers were assigned to the method most like their 
usual style, not randomly. The students in each class were those assigned ' 
to that teacher by the school's routine procedure. The study is a quasi- 
experiment. One can reasonably generalize from the A data to the population 
of teachers likely to opt for a meaningful method (in the schools of the 
late 1930's). I prefer not to regard the A and B teachers as 
samples from the same population. The classes may well, be random 
samples from a single population of classes, but classes within 
treatments differed in ability level. 

Among Anderson's many pretest scores, we found it sufficient 
to use just two, which we label ABILITY and PRECOM. The former 
is a conventional group mental test rescaled to have mean zero and s.d. 100 
over all cases pooled. The latter Is the total score on the Compass 
achievement test in arithmetic computation, at the time of pretest. It tco 
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was put on the 0,100 standard-score scale. The dependent variable (ZACH) 

was a similarly scaled composite of subtests from the Compass posttest and 

from the Analytic Skills of Attainment, Rescaling makes it easier to 

compare regression coefficients for variables with different metrics ♦ 

These will not be standardized regression coefficients. The s. d. of each 
variable varies with the group, 

Webb and I did not use the single stepwise regression analysis suggested 

P* 3.2, because it is a comparatively awkward way to arrive at 

descriptive statistics for separate classes. Instead, we carried out 

separate regression analyses within treatments and within each class. 
This costs more in computer time than a single generalized analysis, but 

the ease of interpretation saves investigator time. The procedure does 

not, however, generate inferential statistics on the treatment contrast. 

A weighting decision . '^^ ervaluate 

6^ and other between-group statistics within a treatment one has these 
options: 

1. Calculate and each group. Enter these 

'c c 
k pairs in the computations. Or 

2. Carry out the computations but weight each pair 

^^X * ^ corresponding n . 
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In the model Webb and I used as a weight in defining parameters. 
Weighted calculations from the sample give unbiased estimates of the 
weighted parameters and the unweighted calculations in general do not. 

Regressions of ZACH on PRECOM . Within each treatment, 
regression analyses were made with the group mean on PRECOM as 
predictor and with the individual's deviation score as a predictor. I 
ignore the constant terms, which are of no immediate interest in 
this report; the treatment means did not vary greatly. The unstand- 
ardized regression coefficients were as follows: 

Between classes Within classes 



Drill treatment 0.74 0.73 

Meaning treatment 0.47 0.71 

The difference in the between-classes coefficient is large enough to 
be of potential theoretical and practical interest if taken at face 
value. Apparently, differences in X means produce comparatively 
large differences in outcome means of drill classes. One can ration- 
alize this by hypothesizing that when an able class shows good results 
on drills the teacher steps up the pace and covers more topics or more 
variants within a topic. Increasing (or reducing) the amount of work 
covered is comparatively easy in a drill class. Practically, this 
difference in coefficients coupled with a near-zero difference in 
overall means suggests the hypothesis that the drill method is best 
for classes formed of high-PRECOM students, and the meaning method 
best for low-PRECOM classes; but this would require verification 
on classes formed in that wav. 
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A between-classes coefficient in a small study cannot be trusted, 
^v^" though Anderson's study was large by conventional 

standards ~ over 400 students — only 8 or 9 classes contributed to 
each regression equation. A difference in coefficients much greater 
than 0.27 (In this metric) would fall short of significance with such 
a sample. When we plotted the means (see figure) the two sets of 
points seemed to lie within the same distribution. Coefficients are 
most strongly influenced by data points at the extremes of the X scale. 
At the right end the extreme points in the two treatments are close 
together. At the left end, the extreme point for drill pulls 
its slope down, whereas the extreme 

point for meaning is very little below zero on the Y scale. This 
alone seems to produce the difference in final slopes. 

For a more formal consideration of statistical inference, see Section 7. 

The within-class coefficients are almost exactly equal. Taking 
the coefficients at face value, the two coefficients for drill are 
the same, which is consistent with the view that individual aptitude 
determines performance and context effects are lacking. These data, 
however, give no basis for ruling out the hypothesis that if students 
were taught individually the overall slope would become much flatter 
(no systematic adjustment of the pace to ability) or much steeper 
(students truly moving at their own rate). The fact that the between- 
class slope is smaller than the within-classes slope for meaning would 
Invite other speculative interpretations — for example, that the 
comparatively able members of a class drive the level of discussion 
up to the point where the less able become confused. All such 
interpretations become moot when the uncertainty attached to the between- 
^ groups coefficients and the possibility of demographic effects are borne in mind 
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Figure 5,1. Plot of class means in the Anderson study 
(From Cronbach Webb, 1975) 
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Along with the coefficients within single classes/ the following 
array gives the class means on PRECOM in parentheses. 

Drill: 1.07 .93 .90 .90 .84 .63 ,56 .55 .51 
(34) (-45) (38) (17) (-7) (6) (-13) (81) (-118) 

Meaning: 1.12 1.08 .87 .76 .73 .67 .50 .43 
(-1) (-7) (38) (23) (-75) (-8) (108) (36) 

These distributions of b*s do not differ. The weighted averages are 

0.75 and 0.78 respectively. 

An investigator gathering data such as these today would be wise to 

ask why some coefficients are twice as large as others. A coefficient 

is an historical fact about a certain group of 

identified students, going through a unique series of local events 
that realized in a specific way an intended treatment plan. It is 
as legitimate to contrast high- and low-slope classes as it is for 
the historian to contrast, say, Utopian settlements that succeeded 
with Utopias that failed. 

In the drill classes, the 6^'s are positively related to the class 
means (r = 0.28). The low-slope outlier whose mean is -118 contributes 
so much to the correlation that it probably should not be taken seriously. 

In the sample of meaning classes, the 

correlation is -0.34; the 

outlier whose mean is -75 weakens an otherwise-strong negative trend. When 
the sample of classes is of the normal size, trends such as these will 
never be convincingly established as characteristic of the population 
of classes. Nonetheless, the analysis is a reasonable step in learning 
from the data. 
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The variance of ZACH scores within treatments was broken down as 
follows (all figures are percentages): 



Drill Meaning 

Between classes 21.5 14.9 

Regression i6,5 8.0 

Residual 5,0 6.9 

Within classes 78,5 85,1 

Regression 51,5 57,3 

Regression 6^-6 4.1 5.6 

c w " 

Residual 22.9 22.2 

Individuals (overall) 100.0 . 100.0 



These values are pretty much what one would expect: within-class 
differences account for more variance than between-class differences, 
and the predictable variance is larger than the unpredictable 

variance. The specific-within-class regressions do not account for 
much variance. Comparatively little of the between-class variance in 
outcome under the meaning treatment was predicted; this is consistent 
with the slope reported earlier. 

It is well to keep absolute 
magnitudes of effects in mind. (In the Anderson data, the ZACH 
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variances within Drill and within Meaning were 9880 and 10011 
respectively.) A between-classes effect, for example, should 

not be dismissed as unimportant merely because it is small relative 
to the scale of individual differences; at some point, one must 
consider the meaning in absolute terms. When a test is the dependent 
variable, what is "important" is judged on the basis of the absolute 
proficiency required to earn various scores. 

Regression s of ZACH on ABIL . Anderson's original analyses were 
bivariate> the individual level. His regression 

planes relating achievement to ABILITY and PRECOM had different slopes, 
Drill appeared to generate better 

achievement for students with high PRECOM and comparatively low 
ABILITY whereas meaning gave better results for those with the reverse 
pattern ("underachievers") . Before making this calculation, Anderson 
removed a subset of superior classes from the sample. Webb and I 
retained all classes in our calculations. 

Instead of analyzing ABILITY we formed a variate ABIL defined as 

the value of ABILITY - 0. 47 PRECOM, restandardized to a 0,100 scale. 

Since 0.4 7 was the overall 

regression coefficient relating ABILITY to PRECOM, ABIL and PRECOM 
have little redundancy at the individual level. To have used ABILITY 
as a predictor in a univariate analysis would echo so much of the 
information in PRECOM as to obscure the interpretation. 
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The unstandardized regression coefficients onto ABIL were as follows: 

Between classes Within classes 



Drill treatment -0.20 0.39 

Meaning treatment 0.31 0.52 

When PRECOM was the predictor, the Anderson finding had led us to 
anticipate larger coefficients in the drill treatment. This was true only 
of the between-classes coefficient, and we have dismissed that finding as 
untrustworthy. Anderson led us to anticipate smaller coefficients in 
the drill treatment with ABIL as predictor, and again the principal 
difference appeared between classes. The difference is impressively 
large — but is it worthy of serious consideration? 

The plot of group means again suggests that the two sets of 
means have the same distribution, in that range of ABIL 
where both treatments appear. The negative slope in drill would turn 
slightly positive if one class at the upper right were discarded. The 
salient feature of the plot, however, is the narrow range of ABIL 
means in drill classes. 

Tracing this back, we found that across -drill classes ABILITY 
and PRECOM were highly correlated (0.74), but the correlation was 
near zero (0.09) across the meaning classes. The drill-class means 
in ABILITY were largely redundant with PRECOM. 

Consequently, there was little variance in the second dimension of the 
between-class predictor distribution. The small variance in ABIL across 
drill classes meant that the between-class slope onto ABIL is almost worthless 
as an estimate of the population regression. 
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Anderson's study foundered on an accident of sampling* The 
classes in the two treatments appeared comparable to him, since the 
univariate between-class distributions on ABILITY and PRECOM were 
similar. Also, the bivariate distributions for individuals pooled 
looked much the same. Anderson failed to inspect the bivariate 
distributions of class means. The points in the drill distribution 
lie nearly in a straight line, whereas the meaning distribution is 
elliptical. A chance failure to assign "off-line" classes to the 
drill treatment spoiled Anderson's chance to get information on the 
bivariate regression. Smith and Bissell, it will be recalled (p. 3.6), 
found a similar anomaly in the between-groups covariances of predictors 
in the Westinghouse study^ even though Head Start and control cases had 
supposedly been matched. The Westinghouse investigators 

evidently inspected univariate between-center statistics 
and, like Anderson, failed to observe the mismatch of the 

multivariate distribution of center means for predictors. 

In Anderson's data, the slope difference onto ABIL within groups 
is too small to be worth interpreting. I shall not pursue 

further details of the study. 
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Cooperative Reading data* 

Plan of the studies . The Cooperative Reading study of the mid-1960s 
was a forerunner of other "planned variation" studies. To compare a dozen 
methods for teaching primary reading, 27 research contracts were let. Each 
investigator was to adopt certain features of a standard design, but he 
was free to add procedures and to introduce treatments that interested him 
alongside the standard treatments. We concentrate on the comparison of 
Basal (B) and Language Experience (LE) methods. Each investigator prepared 
his own reports, and a composite analysis of all ^the data was made by 
Bond and Dykstra (1967). The reports attracted our attention because 
nmny ATI were reported; a summary of those, prepared mainly by Snow, 
appears in Cronbach and Snow, 1976, Chap. 8. 

The director of each of the 27 studies selected intact classrooms 
whose teachers agreed to participate in the study and assigned classes to 
treatments. Directors did some matching of teachers across treatments on 
the basis of amount of experience, and on achievement of their students in 
the previous year. In most of the projects teachers ranged widely in 
rated competence. Most teachers were experienced in teaching first-gra^e 
reading using basal readers; few had taught by LE. 

Some project directors matched classes across treatments on the 
basis of student aggregate performance in kindergarten, and on aggregate 
SES. Most projects used students of varying ability. In some projects, 
ethnic backgrounds of classes happened to differ from treatment to 
treatment. In few of the projects comparing B and LE were classes 
randomly assigned to treatments. 

*Noreen Webb is coauthor of this section. 
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In the Bond-Dykstra analyses comparing Basal to other methods, 
the non-Basal approaches seemed In general to be superior to Basal programs. 
P Students superior in certain abilities seemed to 

achieve better in LE than in B. Less able pupils profited more from B. 
This relation was not clearly interpreted, however, since Bond and Dykstra 
were unable to carry out a multivariate ATI study using all readiness 
and aptitude scores together. 

A reanalysis of a subset of the data „ith sophisticated multivariate 
'techniques "as made by Lo (1973). He reported a 

significant advantage for students with high perceptual speed (i.e., 

high on Identical Forms) in LE, whereas those low on the scale did better 
in B. Lo's analysis pooled classes and projects within treatments. 
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The original analysis across projects . Four projects compared B 
with LE. The B classes usually followed traditional Ginn or Scott-Foresman 
readers. In LE classes pupils told stories; these stories formed reading 
material which incorporated the children's language patterns. The methods 
varied slightly across projects. 

Treatment groups within projects ranged from 219 pupils to 652 
pupils; the number of classes per treatment ranged from 10 to 27. 
Class size varied from 8 to 32. 

Students in all projects were tested in September 
of Grade 1 on the Pintner-Cunningham Intelligence Test and on several 
more- specific variables (e.g.. Phonemes, Pattern Copying, Word Meaning). 
Five subtests of the Stanford Achievement Test Primary Battery I were 
administered after 140 days of instruction. 

Bond and Dykstra first analyzed in a Sex x Treatment 

X Project design, working from the means for boys and girls in each class. 
The unit of analysis was thus the half-class mean. 

Analysis of variance was performed on each pretest 
or posttest. Two analyses of covariance were performed on each 
posttest, one with Phonemes and Identical Forms as covariates, 

the other with all seven pretest measures as covariates. 

Girls scored significantly higher than boys on all pretests. 
On 6 out of 7 pretests projects dif fer^significantly. Treatment groups 
differed significantly on 4 pretests.. On one pretest, a significant 
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Project X Treatment interaction appeared* These last results strongly 
imply that the treatment groups are not random samples from the same 
population, though the significance tests are questionable (see below). 

Most sex differences at posttest tended to 

disappear in the covariance analysis. The few treatment differences 
found could be attributed to chance. 

Significant differences among projects and Project x Treatment 

interaction effects turned up. Because the treatments behaved 
differently from one project to another, Bond and Dykstra decided to 
analyze within each project. 

Half-class as unit of analysis ? The Bond-Dykstra anova is excep- 
tional in its design, and a discussion of it will extend thinking on 
units of analysis. The example is so exotic, however, that I give it 
little space. Their design and analysis are shown schematically in Figure 5.2. 
Three factors are crossed. Male and female halves cf a class were taken as the 
unit of analysis, with no attention paid to the nesting of halves within classes. 

To ignore classes is in effect 

to assume a priori that the component of variance associated v/ith 
classes is zero. This would be self-evident only if half-classes had been 
formed and treated separately, and then arbitrarily paired for the analysis. 
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Error term 



Figure 5.2. Scheme of the Bond-Dykstra anova 



Effects 



Effects 




Error term 



(b) Half-classes within ijk Error term 
Figure 5.3. Analysis 
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An analysis that more adequately recognizes the pairing of half- 
classes looks upon the half-classes as repeated measures on the class* 
(This formulation was suggested to me by Dan Davis •) The analysis proceeds 
as suggested in Figure 5*3* Since sex is a fixed factor, the error term 
for the i, j, and ij effects is (a) , the mean square for classes* 
The error term (b) applies to the remaining effects* 

The error mean square of Bond and Dykstra is closely related to (b) 
but it includes variance attributable to class and it claims twice as many 
d.f. for error. 

The original analysis within projects * In the within-project 
analysis Bond and Dykstra made the student the unit of analysis in order 
to look at ATI effects* The treatment main effect usually favored LE, 
though this was reversed in some projects* To study interactions, subjects 
were blocked in turn on Pintner (4 levels). Phonemes (3 levels), and 
Letter Names (4 levels). In only one project (Stauffer) were many inter- 
actions reported as significant; the child with poorer readiness tended to 
achieve more in B, whereas the able child profited more from LE* Bond 
and Dykstra dismissed this result, judging that the apparent 
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interaction effect with one variable might be accounted for by initial 
differences on other variables. They lacked a procedure for handling 
the pretests simultaneously. 

Bond and Dykstra were too hasty in dismissing ATI in the other 
projects simply on the basis of nonsignificance. According to 

Cronbach and Snow, blocking on aptitude produces extremely weak 

significance tests even when students are properly the unit of analysis 
and N is large. 

Bond and Dykstra attributed to chance several borderline significant inter- 
actions that involved the same outcome variable. But when significance 
tests lack power, it is a mistake to let descriptions of nonsignificant 
but interesting effects drop from sight. 

Bond and Dykstra recognized the virtues of taking the class as 
unit of analysis, saying flatly that the class mean is the correct unit 
for their analyses of treatment effects (not distinguishing this from the 
half-class). Like other investigators of the period, they overlooked the 
concept of class-level ATI. Because they saw analysis at the class level as a 
controversial procedure, they compared estimates of treatment effects 
from their half-class-level analysis within projects and their pupil-level 
analysis. They pointed out that differences in procedure (especially 
covaryinr; out Pintner in the individual analysis and blocking on it in 
the half-class analysis) obscure the comparison. The mean differences 
as woU as the significance levels were greater in the individual 
analysis, by factors as Urge as 8 to 1. Higher significance levels 
are to be expected, because of the increase in claimed d.f.; the increase in 
means, however, was not explained (see our Section 8). 
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Procedures in our reanalysis . Professor Dykstra supplied us a 

set of data for reanalysis, including only the pupils for whom 

second- as well as first-grade data were available. Moreover, we 

discarded classes where many data were missing or punched as zero. 

A zero punch sometimes implies a missing score; even if that is not 

the case, numerous zeros imply questionable test administration. Our 
» 

analyses iti the first grade therefore cannot match the original report. 

We used data from three projects" that compared B and LE. The 
numbers of students for whom we received data were 211, 189, and 181 
for B, and 171, 183, and 199 for LE. We dropped two LE classes with 
many zero scores from one project, reducing N for that project from 
199 to 169 students in 8 classes. Two B classes in that project 
exhibited many zero scores on pretests other than Pintner. We 
retained these classes in the analyses using the Pintner pretest, 
but dropped them in analyses of other pretests, lowering N from 181 
to 146 students in 8 classes. Likewise, in analyzing Pattern 
Copying and Identical Forms we set aside an LE class where zeroes 
were frequent. After cleaning we had from 8 to 11 classes within 
a treatment available for analysis. 
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We foraed a composite outcome score (POST) from the Reading and 
Paragraph Meaning subtests of the Stanford Achievement Test, weighting 
each inversely with respect to its s.d. within treatments pooled. Here 
we shall not discuss the remaining Stanford Achievement Test subtests. 
Our conclusions about the contrasts between levels of analy- 

sis of the composite are supported, however, by analyses on Spelling 
and on Study Skills. 

We used the following pretests in the reanalysis: Pintner, 
Murphy-Durrell Letter Names and Learning Rate, Thur stone Pattern 
Copying, Thurstone-Jeffrey Identical Forms, and Metropolitan Listen- 
ing Tests. We did not consider thePhDnetics or Word Meaning subtests 
because of the prevalence of missing data in our sample. For demon- 
stration purposes we take up one predictor at a time here, though a 
multivariate analysis would be more adequate. 

Pretest intercorrelations 
C were low. Therefore we calculated only univariate regres- 

sions. The composite outcome and all pretest variables were standardized 
to mean zero and s.d. 100 over all cases pooled. POST thus becomes 
ZPOST, Pintner becomes 2PINT, etc. 
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Results of our analysis . We obtained a conventional regression 
coefficient (cases pooled) within each treatment within each project, 
across all projects within a treatment, and across all treatments within 
a project. These are more or less conventional analyses. Second, 
have a between-classes regression coefficient within each treatment 
from the analysis of class means within a project. Third, we have 
a set of pooled within-classes regression coefficients, each calculated as the 
mean of specific within-class slopes, for the combinations of treatment 
and project listed above. Fourth, we have the regression within each class. 
In order to simplify, the tables to follow report data for only three 
pretests, but conclusions are generally drawn from six such tests. 

(1) Conventional regressions . The conventional regression 
coefficients of the standardized composite outcome variable (ZPOST) onto 
C the standardized readiness measures appear in Table 5.1. 

The slopes of ZPOST onto the standardized scores 

for ZPINT, ZIDEN (Identical Forms), and ZLIST (Listening) were higher in 
LE than in B for all cases pooled within a treatment. These differences 
generally reappeared within projects. Differences of around 0.25 are neither 
dramatic nor trivial. Taking that figure at face value, 

a student 2 s.d. below the mean on ZPINT will rise 1/2 s.d. in posttest 
performance if he moves from LE to B. 

In the regressions of ZPOST onto ZIDEN, the only large effect 
appeared in the Stauffer project, where the slope in B was 

close to zero. In the Stauffer project there were rather 
large slope differences of the same kind on ZLIST, ZLET, ZLRN, '^COPY, 
and ZIDEN. In the other two projects slope differences were usually 
negligible. 

O ^ We move on now to decompose the effects. 

139 



Table 5./ . 


Conventional unstanJardized 


regression coefficients of ZPOST 


at 




for individuals 


pooled 














Project 






Predictor 


Treatment 


Cleland 


Hahn 


Stauf fer 


Projects 
pooled 




B 


.09 


.43 


.30 


.31 




(s.d. 


= 88.3) 


(93.1) 


(94.8) 






L£ 


.30 


.73 


.62 


.56 






(84.4) 


(82.4) 


(134.1) 






i oo j.ea 


.1/ 


. 57 


.52 


.45 


ZlDEN 


Difference 


• 21 


.30 


.32 


.25 


B 


.17 


.27 


.09 


.20 






(101. o; 


(89.4) 


(118. 4\ 






T IT 


.29 


.30 


.87 


.42 






(69.9) 


1108. 9J 


(76.5) 






JrOOieu 


.16 


.28 


.32 


.26 




Dif f erenrp 


. ±^ 


. 03 


.78 


.22 


*7T TCT 


B 


.06 


.24 


.14 


.22 






(99.2) 


(89. IJ 


(106.7) 






LE 


.15 


.34 


.44 


.39 






178.3) 


(93.0) 


(116. 7 J 






Pooled 


.10 


.29 


.33 


.31 




Difference 


.09 


.10 


.30 


.17 
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(2) Between-classes regressions . Betweeri-classes statistics weighted 
by class size for three variables appear in Table 5.2. The slope differences 
(LE - B) for all six pretests may be sununarized as follows: 

-0.30 to -0.01 0,00 to 0,29 0,30 to 0.59 0. 
Projects pooled (cf. 3 2 

Table 5. 1) 

Cleland 2 1 

Hahn 2 2 1 

Stauffer 1 

Many differences that seem practically important appear 

all the large differences indicating a steeper slope in LE. That is, in 
the 

LE^abler classes do conspicuously better than classes of low average readiness. 

Between-classes analysis ~ which to us as to Bond-Dykstra seems to be the 

a 

appropriate emphasis in this research — paints far more emphatic picture 
of ATI than the conventional analysis. The variation from project to 
project is noteworthy. The Stauffer classes were, as a set, far below the 
others on most of the pretests, \hich may or may not be a causal factor 
in generating slope differences. 

Each of the slopes is determined by 11 or fewer classes, and conse- 
vjuently we can have no confidence that similar results would appear in 
new samples of classes. As in the Anderson reanalysis, plotting data 
points is instructive. In the Cleland project, the slope onto ZPINT was 
negative, suggesting that B was detrimental to high-ability classes. The 
plot of B class means for that project (Figure 5.^), however, showed that 
the negative slope resulted from the deviation of just one class (near 
-100, +100). If that one class were deleted, the slope in B would be 
"0":ive, not negative. 
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Table 5. 2, Between-classes unstandardized regressgion coefficients of ZPOST 



Project 



Pred ic t or Tr ea tmen t 


Cleland 


Hahn 


Stauf fer 


Projects 
pooled 


ZPINT B 


-0.21 


.45 


.58 


.28 


(s-d. 


= 55.7) 


(34.5) 


(25.6) 




LE 


.43 


1.08 


Q7 


• /u 




(44.5) 


(29 8) 






Pooled 


-0.01 


0.85 


0.98 


.56 


Difference 


.64 


.63 


.39 


.42 


ZIDEN B 


.15 


-0.07 


-0.12 


.29 




(70.3) 


(50.0) 


(43.8) 




LE 


.79 
• * ✓ 




z . /u 






(28,4) 


(68.4) 




AS 

. OO 


Pooled 


.14 


.06 


.50 


.32 


Difference 


.64 


.23 


2.82 


.39 


ZLI5T B 


-0.08 


-0.07 


-0.00 


.36 




(45.7) 


(40.5) 


(26.1) 




LE 


1.17 


.40 


1.91 


1.25 




(21.1) 


(33.1) 


(36.5) 




Pooled 


.12 


.13 


1.32 


.69 


Difference 


1.25 


-0.47 


1.91 


.89 
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Figure 5.4. Plot of class means in Reading study 
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The slope differences onto ZPINT in the Hahn project may also 
be a chance result. The B and LE classes could fit into the same 
joint distribution. The range on ZPINT is small and the slopes not 
well determined. 

The LE classes in the project directed by Stauffer formed an 
unusual distribution on ZPINT, split in the middle; some classes were 
very much lower at pretest and post test than the majority of classes 
in that project. In the range where ZPINT > -50, the 8 Stauffer B 
classes have conspicuously poorer outcomes than his 6 LE classes in 
that range. This is an effect worth noting, whether significant 
or not. There is no warrant for a statement about the regression 
slopes, in view of the narrow range of these 6 LE classes. 

The result for ZIDEN agrees with Lo's finding of an interaction of 
treatment with "perceptual speed". On each subtest — even ZIDEN — 
however, LE and B plots fit into the same joint distribution. Thus, within 
projects, every negative slope of ZPOST onto ZIDEN or ZLIST resulted from 
a single class with high posttest and low pretest or from a class with 
high pretest and low posttest (or from one class of each kind). Again, 
we must dismiss the differences between slopes among B and LE classes 
as chance results. 

At the between-classes level no Aptitude x Treatment interaction has 
been established. It is entirely likely that differences in regression 
slopes are chance results. Studies with many classes per treatment are 
needed to estimate between-classes effects. 



141 



5,21 



(3) Within-class regressions > The within-class regression coefficients, 
averaged within each project, appear in Table 5. 3, These varied much less than 
the between-classes slopes, and the differences were small. Just one of 
the 18 differences exceeded 0.30 (Hahn on ZLIST) • It is evident that 
the interaction effects found in the overall analysis arose from the between- 
groups differences (which we recognize as likely to be chance 

effects) and not from within-group differences. If the between-groups 
effects are untrustworthy, it follows that the differences observed in 
the overall analysis are untrustworthy. 

Taking all projects together, the difference for ZPINT is in the 
direction of a finding reported by Bond-Dykstra and Lo from their conven- 
tional analyses — but the effect is very weak (0.46 vs, 0,34) • On the 
other pretests the differences for projects pooled are even weaker. Within 
projects separately, there is not even a consistency 

in sign between the slope differences in Table 5.2 and those in Table 5.3. 

The two comparatively large differences (both in the Hahn project) 
cannot be taken seriously. 

Regression slopes for ZPOST onto ZPINT within Hahn's 11 classes ranged 
from 0.23 to 0.80 in B and from 0.46 to 1.00 in LE. 

The slopes of ZPOST onto ZLIST ranged from -0.70 to 0.81 in B and from 
0,08 to 0.47 in LE. This undermines the conclusion that slopes 
tend to be higher in LE than in B. 

We looked further into instances where a specific within-class 
coofficlent on a pretest was negative. All these negative values were 
tr.iceable to one or more anomalous students who scored high at pretest 
nnd verv low at posttest or vice versa. 
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Table 5.3. Unstandardized regression coefficients of ZPOST 



♦ 




for 


individuals 


within classes 














Project 








Predictor 


Treatment 


Cleland 


Hahn 


Stauf fer 


Projects 
pooled 




ZPINT 


B 


.24 

(68.5) 


.42 
(86.5) 


.36 
(91-2) 


.34 






LE 


.23 


.70 

\lO.O) 


.43 
(102.8) 


.46 






Pooled 


.23 


.56 


.39 


.40 






Difference 


-.01 


.28 


.07 


.12 




ZIDEN 


B 


.16 
(72.5) 


.43 
(74.1) 


.36 
(110.0) 


.31 






LE 


.20 


.42 


.25 
(68.7) 


.28 






Pooled 


.18 


.42 


.27 


.30 






Difference 


.04 


-0.01 


-0.11 


-0.03 




ZLIST 


B 


.12 
(88.0) 


. ji 

(79.4) 


(103.5) 


.22 






LE 


.10 
(75.4) 


.72 
(86.9) 


.15 

(110.8) 


.33 






Pooled 


.11 


.51 


.19 


.28 


t 




Difference 


-0.02 


.41 


-0.09 


.11 
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Conclusions regarding units of analysis . The reanalyses we have made 
of the Bond-Dykstra data — of which only a fraction appear in this report — 
demonstrate the central themes of our theoretical section, 

1. Analyses of the conventional kind, pooling individuals across 
classes, combine between-class and within-class effects in the sample. 
They therefore give an equivocal descriptive picture of the relation of 
outcomes to predictors. We shall later see (Section 8) that they give a 
poor description of adjusted treatment effects. Significance tests based 
on individual-level analysis are unacceptable when classes are the unit of 
sampling. Because between-ciass data weigh heavily in the overall regression 
slopes, any undependability in the between-class results casts doubt on the 
overall results, 

2. Between-class analyses appear appropriate in this study. 

Between-class regression slopes often differ greatly between treatments. 

These differences, however impressive they may be when coefficients are 

compared, are evidently dependent on the inclusion of particular "outlier" 

classes in the sample. With samples of 11 or fewer classes per treatment, 

observed differences in between-class coefficients are untrustworthy. 

The alternative of pooling projects for a between-^classes 

consistent 

analysis leaves us with modest but differences in coefficients, based 
on the unusually large sample of about 30 classes per treatment. Whether 
it is legitimate to combine projects, however, is questionable, 

3. Pooled-within-class within-project coefficients do not differ 

greatly. Even though these coefficients are based on large numbers of 

observations, their statistical stability is low, because the specific 

wlthln-class coefficients differ considerably, ^^uch vnrinbiJity mav be 
an Lmportant subject for investigation. 
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Head Start Planned Variation ^ 

Our third set of reanalyses exploits a fraction of the data 
collected in the Head Start Planned Variation study. This study was 
carried out in 1969-71, in the wake of the Westinghouse study of Head 
Start • Like the Follow-Through study 

that Abt analyzed, this was a prospective study in which a number of 
sponsors set up experimental classrooms using their own "models" of 
instruction; the control groups (chosen by the sponsor) were enrolled 
in "regular Head Start classrooms". Emphasis, however, was to be 
placed on the contrast among experimental groups. The samples given the 
various treatments were not chosen to be similar at the outset. 

My interest in these data was aroused by an ATI study made by the 
Huron Institute (Featherstone, 1973) under the direction of Marshall 
Smith. A number of interactions of treatment differences 

with such variables as the Pre-School Inventory of Caldwell (PSI) and 
prior preschool experience (PPE) were reported, Featherstone analyzed 
data from two cohorts. The 1969-1970 data were used to iden- 
tify hypotheses for more formal testing on the sample of the next 
year. The first analyses are said to have been made by "the Data-Text 
packaged program for unweighted-means analysis of covariance." Huron 
argued that the variation in sample size from model to model was 
fortuitous, leaving no reason for giving greater weight to models which 
had more children. Although this reasoning appeals to me (models being 
considered fixed), classes are random within models and should be weighted 
by size within models. It appears that the child was taken as the unit of 
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analysis in both the first and second set of data, and I do not know how 
the computer package resolved the weighting problem. 

For analysis of the 1970-71 data, Smith set out a most unusual set 
of procedures (summarized in Featherstone *s appendix). The description is 
too limited to remove ambiguity; just what the Huron group did is not 
greatly important here, however, as I am not retracing their footsteps* 
Some comment does seem to be called for. Let me consider their "PSI 
regression 4b" (Featherstone, p, 188), The dependent variable was the 
PSI posttest (PSI2) and the independent variable was *'directiveness of 
model**. For this purpose, all Engelmann-Becker cases and all Bushell 
cases were coded as more directive; and EDC, Bank Street, and Far West 
cases were coded as less directive. Only 183 cases were employed, 
Featherstone speaks of 12 first-order predictors (one being model-group 
and the others being descriptive of the child and his background). 
Class identification was ignored. The full model also contained 
32 first-order interaction terms (11 of them being relations of the 
form Model x Child characteristic), and at least four second-order 
interactions but possibly a much larger number. We are told that 
"regressions were done stepwise with main effects forced in and inter- 
actions allowed to enter one by one to explain the maximum additional 
variance. Results given in the text are for the step on which the 
standard deviation of the residuals was minimum," 
The "results" take the form, first, of the standardized regression 
coefficient and its significance for each of three variables: Model 
group, PSI (pretest) and their product. The latter two were significant. 
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Second, there is a table of "Effects on adjusted PSI [2] score (given 
in s.d. 's)" at p. Ill — a fourfold table, crossing directiveness with 
High/Low PSI-1. The High/Low contrast gives means 0.5/-0,8 in the 
less directive treatment and 0.4/-0.1 in the directive treatment; the 
student of ATI effects could easily believe that the chojice of treatment 
for Lows makes a large difference. 

Attempts by various persons at Huron to provide me with a more 
complete description of the analytic procedure broke down, and I can 
only try to invent a plausible way to get such figures. Perhaps Huron 
tested significance of the three contributions independently, by the 
step-down method of removing each one from the full model. Possibly 
the adjusted scores were deviations from estimates obtained for 
individuals using the full-model regression equation less the critical 
terms for PSI-1, Directiveness, and PSI x Directiveness. A procedure 
even approximately like this would be enormously daring, since it seems 
to abandon entirely the customary assumption of homogeneous regressions 
across treatments, and fits dozens of regression coefficients to obtain an 
adjustment. The final variable is not PSI-1; it is PSI-1 with 
dozens of things partialled out. Such steps could be given 

a strong justification, provided that (1) the variables on which treat- 
ment groups differed at pretest are highly reliable; (2) the prof^tict cevms we 

formed by multiplying deviations from the grand mean ~ anything else 
allows correlations among predictors Lo totally obscure what is happening— 
and (3) children hnd been sampled and treated independently. I suspect 
that all these requirements were violated, but it is the third that 
brings me back to the point of this report. 
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When there are 183 cases and dozens of correlated predictors, 
any one of the partial regression coefficients is likely to be highly 
unstable. This is true even when no one intercorrelation of predictors 
is large. What is worse, in this study the students were treated in 
groups. It seems that^ata come from some 15 classes. (Children for whom 
IQs were missing were left out of Featherstone's Regression 4 ). The" 
number of classes is much less than the number of variables entering . 

the regression equation. If classes are the sampling unit, 

are left 

no degrees of freedom for making estimates of effects. The data 

L have been seriously "overf itted"; it is not unlikely that the final 
regression weights in the full model were fitted to rounding error. 
Analysis at the individual level can be defended, I think, only by 
asserting that each child received independent treat- 

ment, and_ that differences in pretest characteristics and treatment 
delivery, among children sampled one from each class^were no larger 
than would be found for a random sample of childrer within a class. 

We made simple regression analyses, one the conventional overall 
analysis such as Featherstone employed, and two with partitioned effects. 
(The analysis we made and the ancova to appear in Section 8 are more nearly 
like Feathers tone's ^'regression 3" than the analysis just reviewed. 
The reason for reviewing analysis 4b is that Featherstone 's description 
of it is less equivocal than that for 3; moreover, the only summary data 
reporco^ on her PSI studies came from 4b.) 

In a file of data supplied by Tony Bryk of Huron, we selected 
a set of 244 chi Idren in 18 classes of the more directive programs 
(Bushell, Engelmann-Becker) and 315 in 30 classes of the less directive 
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programs (EDC, Bank Street, Far West), to investigate the regression 
of the PSI posttest (PSI-2) on PSI pretest (PSI-1). Featherstone used 
422 children in her regression. 

As in the Anderson and Bond reanalyses, we have regression slopes 
within treatments for three analyses. The raw-score means of PSI-1 
were 38;6 and 35.1 for the directive and nondirective groups (hereafter 
D and ND, respectively). The means of PSI~2 were 49.89 and 44.78. 
The s.d.'s were in the range 9-13. We converted all variables to a 
metric with 100 as the s.d. for all cases together. 

The three analyses all indicate a steeper slope in the ND 
treatment (Table 5.4). In the Featherstone report also, the D treatment 
appeared to be advantageous for children low on the PSI pretest and 
not for Highs. The slope difference in the conventional analysis 
(our counterpart of Featherstone 's) is considerably smaller than the 
difference in the between-groups coefficients, however. Taking the 
coefficients at face value, the between-groups value seems to imply 
that what happened in the ND treatment depended strongly on the ability 
level of the class; this was much less true in D. Within classes 
the interaction is considerably smaller than in the between-treatments 
analysis or the conventional analysis. A reader of Featherstone 's 
report would be led to think that the ND treatment is more profitable 
for individuals with higher initial PSI, but the within-classes effect 
is evi.ently slight. The effect she reported operates mostly between 
classes. If this phenomenon were established as stable, it would argue 
that ND is an advantageous treatment for the child 

placed in a group with high average PSI-1; this would be true whether 
he himself is high or not. 
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Table 5.4. Regression coefficients (PSI-2 on PSI-1) within 

Head Start treatments calculated by various procedures 

9 







2 


Regression coefficients 


Treatment 






Conventional 


Between classes 


Children pooled 
within classes 


Directive 




0.34 


0.617 


0.621 


0.615 


Non-directive 


0.38 


0.869 


1.083 


0.737 


Table 5.5. 


Regression coefficients of PSI-2 


on various predictors 








Regression coefficients 


Predic tor 




Treat- 
ment 


Conventional 


Between 


Pooled within 


Age 




D 


0.33 


0.39 


0.14 






ND 


0.52 


0.63 


0.20 


Prior 




D 


0.26 


0.94 


0.14 


preschool 




ND 


0.07 


0.22 


-0.02 


White 




D 


0.14 


-0.20 


0.27 


(v. Black) 




ND 


0.39 


0.57 


0.19 


MOMKD 




D 


0.21 


0.11 


0.24 






ND 


0.08 


-0.37 


0.21 
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The truth of the matter once again is found in a plot of class 
means. The statistics and the plot are given in raw-score units. 
^ The statistics did not suggest a dramatic disparity. On 

PSl-1, there was a mean and s.d. of 38.6 and 7.0 for D, compared with 
35.1 and 7.9 for ND. On PSI-2 the values were 49.9, 4.7; 44.8, 7,8. 
Any drama has to be found in the s.d.'s for PSI-2. It turns out that 
a small army of ND classes had means below 30 on PSI-1, whereas only 
three D classes were so low. The unimpressive one-point difference in pretest 
s.d.'s represents variation impressive to the eye in the chart. 
Figure 5.5 repeats the story of earlier plots in this chapter: the two 
sets of points are close to indistinguishable, so far as trend is 
concerned [save for one lone outlier] . It would be imprudent to assert 
that there is nothing to the view that the D treatment is comparatively 
likely to produce changes in class rankings. But the evidence for an 
interaction is much less impressive than a total of 559 cases and an 
overall slope difference of 0.25 led us to expect. 

Analyses of other variables give further examples of contrast 
among regression coefficients of the three types. In Table 5.5 are 
exhibited relations of PSI-2 to various predictors, again with 
100 as the overall s.d. for all variables. ^^^^ 
Fentherstone (p. 136) reported that the directive models favored younger 
children in these data. She gave no numerical results to support the 
statement. Table 5.5 shows a weak tendency toward a flatter slope in D, 
which is consistent with her statement, but not impressively so. Again, 
tho difference arises mostly from the sketchily determined between-classes slopes. 

Preschool experience did not have a statistically significant effect 
on P.SI-2, Foatherstone said, but there was a strong effect on postLest IQ. 
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The estimated regression slope of IQ on preschool experience (after 

her complex adjustment) was flat in the less directive treatments, and 

quite steep in D. Since PSI and IQ were strongly correlated, the 

relations for the two ought to be similar. When we regressed PSI-2 

onto preschool experience (with no adjustment) the regression slope 

with 

was flat with ND in the overall analysis; the slope ^ D was only 
modestly positive. The correlation of Age with PSI-1 between classes 
(0.70) was about as strong as that with PSI-2 (0.66). Very likely 
Featherstone's complex analysis did not succeed in "correcting" for 
initial differences in ability. Perhaps her finding arose chiefly 
from the between-groups slope of ability on age at pretest. Adjusting 
by means of the shallow conventional slope would by nc means remove 
the large between-classes trend. (See Section 8 on adjustments.) 

Featherstone's report on interactions of race is on a 

within-projects basis. "Three of the less-directive models show 
effects favoring white children, while*.. [one of] the more directive 
model [sj shows a highly significant PSI effect favoring Black children" 
(p. 149). A project-by-project analysis has advantages 

over a classes-within-treatment analysis. The N's 

within projects are often too small to ollow solid analyses, however. For what 
it is worth, the breakdown in Table 5.5 shows that if there is a 
difference between D and ND it is found in the between-classes regression 
slopes, with the exceptional (but weak and not-to-be-trusted) finding 
that D classes with more black members earn higher scores on the average 
than D classes with fewer black members. Again here, the paradoxical 
reversal of sign was present at pretest. 

MOMfID was one of three SES indicators in the Huron analysis, 
and no clear results emerged. The regression slopes in the table show 
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no strong effect and, as usual, the largest difference is found in 

the weakly measured between-class coefficients. It is hard to credit 

a finding that classes where the children have, on average, more educated 

mothers should do less well, as they do in ND* It will come as no surprise 

es 

that almost the same difference in between-class^ coefficients was present 
at pretest* 

Like the Bond-Dykstra study, this is a comparatively large experiment, 
planned with substantial national resources and subjected to thoughtful 
attention by both substantive specialists and methodologists over a 
period of years. Despite the ambitious plan, the study is manifestly 
too small to permit convincing comparison of the "planned variations", 
with 500-600 children distributed over eight models and each model 
represented by fewer than a dozen classes, ill-matched across studies. 
The Huron analysts had some justification for collapsing so as to 
contrast the D and ND types of model* They mistakenly thought that 
with 200-300 cases per treatment they could perform an elaborate search 
for interactions. In fact, they had 18 cases for most analyses in the 
D treatment, since a class constitutes a case — or so this report argues. 
As has been seen ii. other studies, the interaction effects reported arise 
primarily at the between-classes level, Between-classes effects with a 
limited number of classes are to be viewed with suspicion. Moreover, the 
effects reported by Featherstone seem with suspicious frequency to be an 
echo of between-class trends present before training began. The idiosyn- 
cracy of the Huron analysis should not obscure the fact that adjustment 
of posttest scores for initial differences, on the basis of the overall 
regression coefficient, cannot adjust adequately for between-class 
^ differences. I shall return to this topic, 
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6. Disattenuating regression slopes 
In theoretically-oriented work, relations of true or universe 
scores are the chief concern. In practically-oriented work also, the observed 
r relations ought almost always to be corrected for error of measurement. 

In the classical formulation of the problem, Cp is the 

true score of Person p on the test whose observed score is , and 

2 

is the true score underlying • Then B^^ = ^yX^^^X ' denominator 

being the reliability coefficient for X . The error in Y does not 

2 < 

enter into the correction. Since p-„ - 1, the corrected slope is steeper 
than the uncorrected slope. 

There is no reason to expect the pooled-within-groups and between-groupS 

reliabilities for X to be the same. 

The two reliabilities for X will be the same (in the population) 
if groups are formed at random. Some writers appear to expect between-groups 
reliabilities to be higher just because means are determined more accurately 
than individual scores. An example appears in the Abt report (p. V-6); 
it expresses concern for the biases arising from measurement error in 
individual-level ancova and then adds; "Measurement error, on the other 
hand, need not concern us at the school level: the stability of school 
means is much better than that of the individual child measurements," 
I shall argue that the standard error of measurement for means is luu out 
I'iUii f.ht' reliability coefficient need not be higher between than within groups. 

1 1 is to be noted that if tue wi thin-groups and between-groups 
rt'grc'ssion coct f ic ients are the same, they will not be the same after correc- 
tion lor attenuation ~ if the reliability coefficients differ. Conversely, 
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regression coefficients that differ may become equal or may differ in the 
opposite direction, if the reliabilities differ. This is one more barrier 
to inferring context effects from a difference in observed regression slopes. 
Within-- and between-groups reliabilities 



Three cases ought to be distinguished: 

I. Classes are formed without reference to the scores X. 
Individuals are tested on X before classes are formed, 
or persons within a class are tested independently. 
Class membership is determined in whole or in part on 

the basis of X . Individuals are tested on X before 
classes are formed. 
III. Classes are formed without reference to the scores X. 
X measures are taken within the classes, by a group 
testing procedure. 

In any of these cases a variable correlated with X but observed independently 

2 

of it may influence assignment to classes; i.e., need net be zero in 

A 

any case. The cases have not been considered separately by previous writers. 

They require different psychometric analyses. 

Let us assume that groups are of uniform size n^ . Also let us assume 

that all members of a group are tested under the same n^ conditions of 

facet i (e.g., test items) and the same n^ conditions of j (e.g., occasions). 

The Lermxnulogy of this section and the basic concepts come from Cronbach 

theory 

al., 1972 (!iereafter referred to as CGNR) . Generalizabxlity departs from 
("lassical theory in recognizing several sources of error and in not requiring 
homogeneity of means, variance*^ and correL>ciunb of scores obtained undtr 
ailtercnl conditions, bach person has a universe score ; that is the^'xpetted 
value of his src>re^ when X^^,, is observed on nil i J comb i n.i t i ons {„ the 
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universe. I assume that the same universe is pertinent to every group. I 

write for the ».c^n of the universe scores for the group members. Here 

I write Xp and for the observed scores of individual and group. 

We wish to consider at various points parameters of the overall, between- 

groups, and within-groups distributions. The respective variances — for 

example - will be identified aJ(X ) , aJ(X ) , and a^X )[=a^X -X )1. 

^ P 0 c w p'*- t p c 

Case I. When measurement is independent of group membership and 
vice versa, the generalizability (reliability) of individual-level scores is 
evaluated without regard to groups, in the manner set forth in CGNR. We can, i 
the crossed design assumed above, express the observed score as the sum of the 
universe score, an error component, and a constant: 

1 2 2 

- + + constant. Estimates of Ea^(6) are available 

for whatever design was used to collect the X scores, and these together 

2 

provide the coefficient of generalizability Ep^ . It is necessary to speak of 

expected values of certain variances and coefficient^ because the CGNR model does not 
assume uniformity of error variances; the conditions i and j drawn for a part- 
icular realization of the measuring operation produce a certain population variance 
for 5 or X , and it is the expectation over i and j that is of interest. 
For purposes of disattenuating a between-groups regression coefficient, 

however, one v;ants a group-level coefficient of generalizability. This is 
2 2 

Che ratio of to Eaj^(X^) . Tlie basis for forming groups determines 

2 ? 
an incraclass correlation n (u^ , u^) , or sii.iply n (p). That is the 

ratio of between-groups variance in to the total variance. Since, when p is 

a member of c , = (p^) + (p^ . j,^), 

2 2 2 2 

^7 (ii^) = n (p)o^(Up) and a (.>p-u^) = [1-n^p) Jo^C.^) . 

Alsd, ^ (■ *^ ('p - where 6^ is the average of (S^ over members 

or the «roup. Since classes are formed without 

r.,,.rd to ■ , ...^(■^.;= iy,^%'> - \) = ^K'.^V- Even 

liiouuli persons are fixed, the errors of measurement are random. F- refers 
ERJC the expet-Lation over repeated applications of the same design. The 
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between-groups coefficient of generalizability is 

= Eo\u^ , X^) = r,-(p)a^(Mp) / [n^p)a^(Up) +^£0^(6)) . 

c 

This can be estimated from a G study on individuals which estimates 

•J^CUp) and g2(6) , and from the observed a2 (x ) 

b c 

The formulas for such a G study are given by CGNR. 

@) An alternative is simply to carry out a G study on class means. 

The within-groups coefficient r is 
2 n 1 

EP % - ^c ' ^p V = fl-n^(p)J<'t^»^p)/[ll-n^(p)]a^(Pp )+-^Ea^6)J- 

c 

This is a coefficient for classes pooled. Since n^(p) > - , r ^ r 

- n b - w ^" 

Case I. 

If r^ indicates the overall coefficient, 
1 - r, = (1-t) 

O T-T 

n^n^(X) 

2,... . 2. 



C 



n (X) is less than n^p) unless r^ = 1 . it is likely that 
b ' \ 



n n^(X) > 1 , hence r, > r 

C h ^ • 



ine between-groups coefficient derived here is the same as that suggested 
by Shaycoft (1962), for which Haney (1974a) presents a derivation. Students 
are treated as fixed within classes, llaney goes on to discuss an alternative 
uttered by Wile> (in Witlrock & Wiley, 1970) which treats 

■jtudents as random within classes. 
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Wiley seems to contemplate that a group (the student body in one school, or 
the class assigned to one teacher) could have many "parallel forms" drawn in 
the same manner but not randomly representative of the total pool. His ques- 
tion is how strongly class means would correlate from one such set of classes 
CO their set of DoppelgSngers . 

Case II. When assignment is based in part on the observed X , it is 



necessary to consider not only n^(p) but also an intraclass correlation 
2 . 2, 

n {-.^) or n (6) . If assignment takes into account X and at least one 



other variable correlated with p , n^p)>n^6) . If assignment takes into 



P 



account X plus variables uncorrelated with v , n (p) = n^(<S) • Now 
..2 2 2 2 ^ 

'>^- = • (6)Ec^(5) . (In the limit as n (6) decreases, this degenerates 

1 , 2 

to - Hc^C.) ; i.e., to Case I.) With persons fixed within groups, 
c 

2 

w P c p c 

[l-n^(P^ ol(u^)/[[l-n\p) la2(pp) + [l - n^6)jEa2(6)] 
This between-groups coefficient — which has not been described in the earlier 
literature — is smaller than that for Case I. This within-groups coefficient is larger 

^;d£e^ii-2 Analyses in Chapter VII of CGNR treat data 

rnllectcd in groups, but do not consider simultaneously the general- 

inability (reliability) of individual and group data. m Case III it 

i-. iK'Cf.ssarv to analyze somewhat differently than in Cases I and II. 

Here again, assume persons fixed within groups (p . c ) . 
••- u-uvc-rse ronsists of test forms i crossed with occasions j . 
nu. inv..sti«ator intond.s to generalize from D-study data generated by 
..n.ivinr. th. same form to all groups, each group being observed on one 
"" "i-n. .Such a study has the desir.n [ (p x J):c]x i, n. = I, „. = I. 
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In Che G-study, however, each of the k groups is to take more than one 
form. Let us suppose that each group takes the same n^ forne.each form 
on a different occasion. The design, then, is (p:c) x i ; (j,ci) . The 
observed score for group c on any one i,j pair resolves into components 
in this way: 



^cij = ^ + (v^ - + (Mi - V) 

^•'cij -^,-U, + u)+ e^. . 

Analysis of variance produces Table 6.1. Variance components are 
estimated by entering the actual mean squares in place of the EMS. I 

2 2 
write a (c) for a (u^), etc., as in CGNR. 

The between-groups reliability coefficient r is 

b 

given by the ratio of a^Cc) to Ea^CX^..) , „here the latter is defined 
by the D-study design. With the design specified (one form, one occasion), 
Ea^(X^ij) = a^Cc) + a^Cci) + a^(res) 

In view of the specification that persons are fixed within groups, and 
in view of the intent to adjust a pooled-within-groups regression slope, it is 
appropriate to decompose Xp - X^.^ . The components can be written 

Vij " ^-ij = - ^) 

< c 



The last two components are confounded in the G-study design. 

The analysis of the deviation scores gives the quantities in Table 6.2. 
(The analvsis of variance could be carried out in one step for both 
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Table 6.1 Mean squares in the analysis of group means and equations 
for expected mean squares 



Source of 

variance 

Groups c 



d.f. 



k - 1 



Forms i '^i ~ ^ 



Mean sq Expected mean square as a function of variance components 
MSc 
MSi 



2 2 
EMSc = a (res) + n^o (c) 



EMSi = a^(res) + k a^(i) 



Residual (k - l)(n. - 1) MSres EfdSres = a (res) 



Table 6.2 Mean squares in the analysis of individual deviation scores 
and equations for expected mean squares 

Source of Expected mean square as a function of 

variance d. f . Mean sq variance components 

Person within 

class p k(n - 1) MSp:c a^(res:c) + n.n.a^(p:c) 

c c 12 



Residual k(n^ - l)(n. - 1) MSres:c EMSres:c = (p^i ,p^j ,p^ij ,e) = a2(res:c) 



ERIC 



6.7 



individuals and groups. It ' juld also be possible to make the analysis 
shown in Table 6.2 for one class at a time.) 

These equations apply to the D-study: 

^"^^^^P ij " ^cij^ = ^^(p:c) +Ea^res:c) 

" ^^^S 11 - ^cii' - ^cii> = a2(p:c)/Ea2(X^ - X ..) 

W p^l] CIJ p^ij CIJ " p^lj CXJ 

Compare this with r . The numerators are a^(c) and a^(p:c) . 

^ b w 

As before, (c) = ( p)oli^,^) and o^(p:c)= - n^(p)] a^Cy^) . 

2 

Ruling out stratified random assignment, the minimum of n (p) is 
that for groups formed at random. 

f Then n^Cp)> ^ and a^(c) > a^(p:c)/(n - 1) 

n ^ 
c 

The denominators of r^^ and r^ expand into 

a^(c) + a"(ci) + a^a) + a^Ccj) + a-(ij) + a^(cij,e) and 

Two terms in the upper row have no lower-row counterparts. Within paired 
terms, the upper term and lower terms are in the ratio n /(1-n ) , where 
the T]^ is the value for that component. Whether 

"^^^ ^w "^^^ equal depends on the intraclass correlations. 

Large occasion effects and large intraclass correlations for the pi , pj , 
or (less lively) pij componei^.s will tend to make the group-level coeificienl 
smaller (I) than the wirhin-groups coefficient. Any effect 
associated with the occasion (noise out.^nde the tost room, faulty 
instrucMons, otr.) is common to all members of the r,roup. Since those 



components of error are not independent over persons, averaging within 
the group does not necessarily reduce them. 

2 2 
If n (p) is large and the n for the other three components in the 

wi *-hin-groups denominator are small , the ratio of between-groups numerator 
wi. lin-groqjs numerator will perhaps be larger than the ratio of denominators. 
Then the between-groups coefficient is the larger one. 

Case III analyses can of course be madi for many other experimental 
c}e,s.ign?., 7Tie,..fprmylas, remain worked put according to the principles 

exhibited in CGNR. 

An overall "individual-level" coefficient can 
be calculated by adding the two numerators, adding the two denominators, and 
then dividing. This is not the value that would be estimated for the overall 
coefficient by an analysis that ignored groups. 
Gfene ral re marks 

I have replaced the "individual" (overall) and "group" 
reliabilities of other writers with "group" and "individual-vichin-group' 
reliabilities. Also, I have separated Cases I, II, and III, whereas othe!^ 
writers confined themselves to Case I without realizing it. The six 
coefficients will vary in size, but whether the differences are large 
•nlv future experience can tell us. Surely nc one will question the 
-idvisability of choosing the logically correct coefficient in any 
disattenuation procedure. 

rt should be noted, however, that I have discussed formulas for 
the .o(.f f H-ient of general izabllily only because disattenuation requires 
.1 covificient. In the commonplace investi station of measurement error 
the <.t.indar<] error^is of far more interest than the coef f ic ii-nl . For 
most p.,rpose,s, it is more important to know how well a j-roup is measured 
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than to know whether the measure discriminates between groups. The 
standard error of generalization of the group mean [a (X^ - p^)] is 
likely to be considerably smaller than the within-group standard error 
^\ ii " ^ 3 — as persons are fixed within groups. 



6.10 

Notes for Section 6 

♦ p. <).3 '"The error is defined according to the experirental design. The c^ata 

providing a coefficient of generalizability may not be the sacre as those 

used to calculate the Y-on-X regression* Indeed, the G study iray be carried 

out under one case and the D study (the regression study) under another* 

That may still permit one to deteririne an appropriate coefficient 

of generalizability, provided that the groups used in the G study are a sample 

frorr the population of groups in the D study. In this report I shall assume 

that the G-study and D-study data are collected by the same experimental 

design, on samples formed under the saire case. 
2 

P- 6.5 The Bowers data (p. 2. ) appear to me to be an example of Case III. Hauser 

has pointed out the importance of correcting regression coefficients for 
attenuation in reaching a decision about the apparent context effect in the 
Bowers data, but Hauser assumed that the within-colleges coefficient would 
be small compared to chat between-colleges. In Bowers' study, a mail ques- 
tionnaire on attitude and behavior went to students at many colleges. The 
only facet along which it seems reasonable to classify individual data is 
occasions. No doubt variability would appear if the questionnaire had been 
filled out on two independent occasions. I suspect that there are systematic 
College X Occasion effects, even if all mailings took place in the same 
month. A cheating scandal erupting on one campus would cause; student 
re^ponst>s to a question on cheating to shift, the shift being fairlv uniform 
wi'rhin that college and not appearing in other colleges. If the que:^tion 
ibout having, been drunk is asked before the main event ol the social >x>ar 
o!i ont> rampus and, by the vagaries of the local calendar, is asked -.ubsoquent 
fo tii.il event on another, wo can <n>iln expect apprec iab I variability over 
*n*asio!is that is to he considered a ^»roup-re 1 a t ed error. 

Q Read: "p nested within c' . The code for designs follows C(.NR. 
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7. Statistical inference 
The investigator will wish to generalize formally or informally 
beyond his sample. In the problems considered here, it seems to me that 
statistical inference should center on setting confidence intervals on 
parameters within one treatment. I prefer confidence 

intervals (or posterior distributions) to tests of the null hypothesis 

one being 

for many reasons, the most compelling that in research such as we are 

A 

discussing the null hypothesis has a high probability of survival. 
Confidence intervals enable one to report what he found with due caution, yet 
without suggesting that his study added nothing to knowledge. Posterior 
distributions have the added advantage that, in principle, they enable ex- 
perience to accumulate whereas other procedures treat each study as a new 
venture. 

I propose to discuss only limited aspects of statistical inference. 
Within one treatment, we have to think about the between-^groups and within- 
groups regressions (common and specific) of Y on X . I assume that groups 
are randomly sampled from a population of groups, and that the distribution 

' bivariate normal. I assume members fixed within collectives. 

c c 

I make no attempt to set limits on the regression of Y onto ^ . in 
tinsc, pri»cedures tor setting limits on disatienuated regressions will be wanted. 
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A fcull examination would next move on to estimates of the treatment 

effect. A distinction is required between one-population and two-population 

were 

(or larger) studies, the former being those where groups assigned randomly 
to treatment. The case of homogeneous regressions (traditionally assumed) 
must be considered separate from the case where within-treatment coefficients 
differ. Again, error of measurement is to be considered; use of the 
attenuated regressions in covariance adjustment gives a false conclusion. 
The difficulties of inference 

about single treatments or treatment comparisons have not been resolved even 
for the study where individuals are assigned and treated, with none of the 
complications introduced ^y grouping (Cronbach et al ,, 1976). 

I omit inference about multiple regressions from this report entirely. 
Sampling error of a mean 

The simplest statistical inference evaluates the population mean on the 

basis uf sample information. If individuals are the unit of sampling and 

analysis, the sampling error of the mean is estimated by s(Y)//n . If groups 

are tlie unit of sampling and analysis, the corresponding lormula is sC.^ )/v'k 

c 

vhere k is the number of groups. 

Suppose all groups are of size n; N' - kn r Then j^(;. ) = r,^ :r^(Y) . 

c 

In random sampling tne inlraclass correlation is 1/n , and the two nodes of 
calculation will lead to very similar conclusions. The conclusions will not 
I'f uKnci'al, i< the t distribution depend? on the number of degrees of 



freedom claimed. 

2 



With larger values of , the sampling error calculated at the group 
level - as it should be - becomes quite a bit larger than the one calcu- 
lated ac the individual level. Consequently, analysis with groups as units 
generates comparatively wide confidence intervals. when the null 
hypothesis is valid, the analysis with individuals as units will report 
a significant effect unduly often. 
The between-groups regression 

Where groups are randomly sampled from a population and 
all receive "the same treatment", the well-known procedures for 

establishing confidence intervals for a regression line apply to the between- 
groups regression. The parameters of the regression equation in the population 
are -^^^ and the value of oiu^)lu^ is also pertinent. 
Each sample is characterized by a'pai^ of means, a coefficient b^^, and 
an - . under the assumption of normality, the are distributed normally 
about independently of the s^-. The expected joint distribution of b^^ , s 
pairs permits one to define an elliptical confidence region in the b, s space 
outside which the pair B^, o is unlikely to fall. From this comes the usual 
equ.tiun Which, for a between-groups regression, can be written (Dixon and 
Masscy, 1969, p. 198): 



(k-l)s2 '-Y 
c 



rnis d.s.ribes the lower confidence limit for che regression line. The upper 
lirnt i. d.-,cribed by the same expression Mth t, , replacing t, . There 
- ' d.t. for t. where k is the number ot groups. The two equations 



171 



7.4 



describe an hyperbola in the X, Y space; the asymptotes of the hyperbola 
(which pass through the sample mean ' ) identify the outer limits of 
the regression coefficient. 

Confidence intervals for group data are likely to be wide, because in 
most studies the number of groups is small. If s-- is small, however, as 

happens in some group data, the confidence interval can be satisfactorily 
narrow in the neighborhood of the X mean. 

Samples on the order of 100 classes are required to make b^^ a good 
estimate of 6^^ . This statement comes as an unpleasant surprise to most 
research workers, and some find it hard to believe. A simple example may 
overcome such doubts. 

Suppose that Hm^^ ) = ) = 1 , and that p (p„ , ) = 0. 40 . 

Thyn z = 0.42; 7 = 1/^1^ . If k = 100 , t = 0.10 and 
^ z 

the 95Z limits on sample values of z are 0.22 and 0.62, implying 

limits ot 0.22 and 0.55 on r . Swings over that range, and over the ' 

torrt.^ponding range of b , could be consequential. Just how large a 

samplf to demand is a matter of judgment, of course. 

The wi Liiin-groups regression 

ii.e usual procedure cannot be adapted to establish a confidence interval 
:.-r ti... ,uoled-vithin-,roups regression if individuals are fixed within groups. 
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The several S 



are independent estimators of E3 . If the 0 



are assumed to 



have a normal distribution, it is a simple matter to set confidence limits; 



of as describing two within-groups regression lines both of which pass through 
the point 0,0. (The hyperbola of the between-groups inference degenerates when 
the mean is given a priori .) Whether these confidence limits will be wide or 

narrow depends on the spread of the 6 . 

c 

With regard to the specific 6^, it is difficult to ask a useful inferen- 
tial question in the usual circumstances (see pp. 4 .1-8). 



c 




The number of degrees of freedom is 



k - 1. These limits can be thought 
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8. Analysis of covariance 
Analysis of covariance is used to evaluate the difference among 
outcomes in two or more treatments. An adjustment for initial differences 
is the crux of the procedure. Even when assignment of individuals or 
classes to treatments is random, the choice of unit of analysis has some 
effect on the result. iThen assignment is not controlled, the initial 
difference may be large; then the choice of units may greatly affect the 
adjustment. The standard procedure is to calculate (directly or indirectly) 
an adjusted outcome score for each person or group. Should the adjustment 
be derived from the within-groups, the between-groups, or the overall 
individual level regression coefficient? Many investigators seem routinely 
to assume that the regression coefficient calculated ' at the individual 
level (within treatments) should be used. Among those who recognize more 
than one possibility, some carry out and report alternative analyses 
without a clear basis for interpreting them. 

In analysis of covariance, a number of difficulties arise even apart 
from questions regarding units of aggregation. In the best case, one has " 
an experiment with random assignment; then the analysis with any regression 
equation gives an unbiased estimate of the treatment effect. The statis- 
tical model assumes that the covariate and Its values are fixed, and this 
is not generally appropriate in social and educational research. 
Problems multiply when selection or self-selection determines who enters 
and completes each treatment. Poor data on Initial characteristics — 
f.iilur.. to measure sone characteristic for which adjustment should be 
nadf. or inacc.irate measurement ~ hias the estimate of the treatment 
off...?. I shall say no man- about these difficulties, though they arc 
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pertinent to research on groups. 

Recommendations for analysis of covariance have to take into account 
the design for collecting data. Any of the elaborate designs to which 
analysis of variance is applied can be extended by adding covariates, since 
ancova is anova of adjusted scores. It will be sufficient here to consider 
two designs, and to limit attention to investigations with only two treat- 
ments, A and B. 

Design 1 is an extension of the two-group experiment (or quasi- 
experiment). Collectives are nested within treatments, and members are 
nested within collectives. Collectives are considered to be a random 
sample of a population of collectives; if assignment to treatment was 
nonrandom, there is a population for each treatment, defined by the selec- 
tion rules, explicit or implicit. I have suggested that members be con- 
sidered fixed within collectives, but some analyses treat members as random. 

Design 2 crosses treatments with a blocking factor. This factor may 
be a higher-order collective. In the Performance Contracting experiment, 
schools — the unit to which treatments were assigned — were nested 
within school districts, each district in the study having a school in 
each treatment. The factor may be a potential cause whose main effect is 
to be removed from the error variance, as when every teacher handles a class 
In each treatment. The factor may be a characteristic of 

persons* as when class membership is determined by selecting students 
i/lthin cert.iin TQ ranges. Once collectives are identified within blocks, 
rhev m.iv be a^si^ned to treatments randomly or not. Collectives within 
a tronttnent ire assumed to he experimental 1 v independent. 
^{^'^S i^Jl J^j. J J1?_?jL?A JIL eat men t s 

It i<i tisiial in ediirntionnl research to rhoose one* set of schools or 
classes for Tre.itment A and to choose independent 1 v another sit. for R. 
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This was the design in the Follow Through study where Abt (Cline £t al . , 
1974) offered analyses of covariance at the pupil, class, and school levels. 
"Where results are consistent for parallel questions across the three 
levels of aggregation", they said, "we have enhanced confidence that they 
represent the true effects." This was a study with nonrandom assignment 
and the treatment populations differed in initial characteristics. 
The analyses would be most unlikely to agree with each other 

even if Abt had used the same variables in each analysis. Each covariate 
was formed by multiple regression, hence different composites were used to 
make the three adjustments. It would require an enormous coincidence for 
adjustments made with different composites and different regression 
coefficients to be the same. 

Alternative adjustments . It will be instructive to consider a 
detailed list of alternatives, though some of them seem unreasonable 
a_priorl. To keep matters simple, suppose that there is one perfectly 
reliable covariate X , that there are just two levels, and that 
corresponding regressions have the same coefficient from treatment to 
tmtment. But do not assume that Bj^ = 6^ . Set the mean value of 
X for ail cases at zero. 

Ecjrh analysis determines the intercept of a within-treatment regression 
line at X = 0 . The heart of ti process is to fix a coefficient P . 
Ilu-n one siibtracLs F>X from the Y score for each unit of analysis and 
averai;es within the treatment. Tlie coefficient may be determined in these 
wavs : 
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1. Overall. Calculate a wi thin-treatments regression coefficient 
without regard to boundaries of collectives. 

2. Between collectives. Calculate a regression for collective 
means, within treatments. 

3. Within collectives. 

3a. Convert scores to deviations from the mean of the collective* 
Pool collectives, and calculate the regression coefficient. 

One can obtain an intercept for each collective, or for 
all collectives. 
3b. Calculate a regression equation within each collective, 
and use it to adjust scores of members. This gives an 
intercept for the collective. 

3c. After calculating within-collective coefficients as in 3b, 
determine the trend of coefficients as a function of the 
mean of the collectives on X . For collectives with 
any X mean, obtain a coefficient on the basis of this 
regular trend. 

In analysis 1, the number of degrees of freedom comes from the number of 
individuals. In 2, 3b, .-^nd 3c, the number of classes is the basis for 
determining degrees of freedom. In 3a, investigators might adopt either 
basis for calculating degrees of freedom. 

The several analyses are illustrated in Figure 8.1, which shov;s 
schematically the data for three collectives in just one treatment. 

In this i I lu^^t rat ion the 

between-p,ro!ip8 slope is steeper than any of the withln-group slopes, and 
withln-r.rniip slopes have a regular trend. Panels (i), (ii), and (iii) 
roprrsent a<i jti<^tnrnts 1, 2, and 3, i ♦ - i v** ! v . 
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Figure 8.1. Estimates of adjusted treatment mean consider- 
ing three alternative regression lines 
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The adjusted mean in Panel (ii) is the .-smallest of the three. 
Although it is not necessary that the between-groups coefficient be 
larger than that within groups, experience shows that this usually is the 
case. Adjustment by means of the between-groups coefficient, then, is a 
more drastic adjustment than the others, and le^ids to a less favorable 
conclusion about the treatment for which the sample stands higher on X . 

The pooled-within-classes adjustment (3a) is shown by the dotted 

line in Panel (iii); it gives the largest of the adjusted means. The 

overall adjustment in Panel (i) is close to that in Panel (ii) because of 

(] the large . It will always 

lie between the adjusted means from the between- and pooled-withln 

2 

adjustments. It will be close to the latter if is small. The adjustment 

'Jc that takes into account the steeper slope in classes with 

higher X gives the intercept indicated by the arrow in Panel (iii). 
This 1<? the average of the separately adjusted means for the classes. 

]f it is believed that group-caused effects are nonnegligible, the 
bc'twcren-collectives analysis appears to be appropriate. Each collective 
is an independent realization of the effects, group-caused and other, 
sampled from a population of realizations. The intent is to generalize 

to a pop'jlation of realizations for which tiie overall mean X is nt>ar zero. 

If there are no group-caused effects, then one could still analyze 
appropriate! V at the between-collect ives level. Collectives are the unit 
of samplmt;, and unless collectives differ oiilv bv chan<!e on initial 
vari<iMes relevant to Y , ihe statistic. 0 inference has to hf^ at the 
>;roup Uvel. 

1 st'e no wiv to defend adJuFtnent hasi^l on anv of the - octfi. hnts 
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calculated within collectives* Note that the uppermost regression line 
in Panel (iii) projects the unadjusted mean for Y in the rightmost 
collective into a considerably lower adjusted mean. If this has any 
meaning, it is a prediction as to what would happen if individuals with 
X = 0 were treated as members of a collective with a high mean on X . 
The extrapolation is rash on its face. To use it in evaluating the 
treatment, however, embodies the even more absurd extrapolation that this 
is the mean to be expected "if this class were made up of students whose 
mean score is zero". The presumed reason for a steep slope in the high-X 
classes is that the level of X makes a difference in the slope, so that 
the extrapolation is self-contradictory. 

If within-groups slopes are irrelevant, why mention them? My reason 
is that they crop up in practice! Most obviously, every time an educa- 
tional investigator performs an experiment with one collective per treat- 

'lis analysis of covariance uses adjustment 3. 
(Since there is only one class per treatment, cases 3a, 3b, and 3c are 
indistinguishable, and indistinguishable from the overall adjustment 1.) 

The .inalysis is just, if the class is a random collection of individuals 
vh<, respond independently to the treatment. If not, the investigator has 
-Kliusted without information on • If Bj, ^ B^^ , he has overadjusted 

or iindor<id jus tt»d . 

In 1 mul t i^roup study, 

^ one knows or is prepared deliberately to assume P = < , 

b 'w > '-"^ 

.m.r.i.'l analysis is ju«itiflod and the others are less suitable. The 
.■vorall analysis also mak^-.s sense when individuals are sampled individually. 

""i tho individual's experienro is not systematically associated with that of 
..th.rs fn his group. In this latter .ase, demographic effects n,ay ...use a 

<1 i t t er«M»< e between r ,-in<I ^ », , . _ l 

Q 'b "-"''^ ^''^-^ hearing on the estimate of the 
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Cooperative Reading data > To illustrate the contrasts among 
analyses 1, 2, and 3a, Webb and I processed data from the study of Hahn 
within the Cooperative Reading Program. We used 183 students in 11 classes 
that had followed a Language Experience (LE) program and 189 in 11 classes 
in a Basal (B) program. The raw score on Stanford Word Reading at the end 
of Grade I (a component of POST) was our dependent variable, and the 
Pintner test of mental ability our covariate. The B group had a Pintner 
mean higher than the LE group (1.24 points higher; about 0.2 s.d,)» 

1. The first analysis used the overall regression. Scores within a 
treatment were pooled without regard to class members'hip , The overall 
regression slope for the combined treatments was 0.45. The covariance 
analysis gave this information: 

Mean for LE before adjustment, 24.5 ; after adjustment, 24.2 
Mean for Basal 22.2 ; 22.5 

Diffirence 2.3 1.7 





SS 




d. f . 


MS 


T satraents 


290. 


.94 


1 


290.94 


Within treatments 


11384. 


02 


369^ 


30.85 


Ad jiistment 


4289. 


71 


1 





F - 9.43 



15964.67 

To reco^ni^e j^ossiblc homogeneity of regressions we also calculated 
the '^oef f [rit^nt within each treatment separately. The regression coefficients 
were fK)> in If. and 0.37 in B, and the corresponding; adjiisteci means were 

and .\\'4. ]hv ust' oi the '-iptM^itii' nenh it-nts had little elit^t on the 
<iitii r. n. in ,u\]\r^i*H\ n»Mns, as i<> to hv <"<;Hilid; titt;n>; within a trl^U^)e:u 
fii-> r »-n»'iM prim,irilv rranriru; tf>e rr^idua^ varian<t'. 
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2. When class means were used to calculate the within-treatments 
regression coefficient, it rose to 0.63. The covariance analysis was 
carried out as before but with the adjusted class mean as the dependent 
variable. This score was entered for each class member, in keeping with 
our policy of weighting. The sums of squares from this analysis were used, 
but the number of degrees of freedom for the denominator given by the 
computer, based on individuals, was replaced with 20, based on classes. 





SS 




d.f. 


MS 


F 


Treatments 


222. 


25 


1 


222.25 


2.34 


Within treatments 


1894. 


01 


20 


94.70 





The F ratio does not reach significance. An adjusted treatment effect of 
1.5 replaces the 1.8 of the individual-level analysis. The adjusted 
treatment means are 24.1 and 22.6. (Bond and Dykstra reported an adjusted 
individual-level treatment effect of 1.8, matching ours. After class- 
level adjustment, they reported an effect of 5.6, however; I have been 
unable to determine why.) 

}. The within-classes coefficient for treatments pooled was 0.42. 

The summary in Table 8.1 indicates that the shift in methods of 
Kljusinent did not produce a great difference in the adjusted treatment 
^■f'^-''-- The shift from a claim of sig.^if icance to non- 

<^i",niflranre stems from the larger error variance tliat accompanies the correct 
nurtoer of degrees of freedom. 

L^w J]iJiHiS.'li!il'-i'-- more brief example can be derived from 

llio Follow Tl-.rough data mentioned in Section 5, Featherstone reported 
tii.if children with prior preschool experience were better off in a 
less-<nrectiv treatment. The study had classes nesttd within treatments. 
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Table 8.1. Coefficients and adjusted means from three analyses 



Analysis 



Coefficient (s) Means after adjustment Difference 

LE B 



Anova 
Overall 

Between classes 
Within classes 



.45 
.63 
.42 



24.5 

24.2 
24.1 

24.2 



22.2 

22.5 
22.6 

22.5 



2.3 
1.7 
1.5 
1.8 



Table 8.2. Alternative adjustments of hypothetical data for three collectives 
within a treatment 



Center 


Mean 


SES 


Mean 


MRT 


Adjustment 


Adjusted 
B Mean 


A - B 
Difference 




A 


B 


A 


B 


1 


2 


1 2 


1 


2 


1 


7 


8 


5 


o 


-3 


-1 


-1 1 


6 


4 


2 


S 


6 


0 


0 


-1 


-1 


-1 -1 


1 


1 


3 


] 


4 


-5 


-2 


1 


-1 


-1 -3 


-4 


-2 


All 


3 


6 


0 


0 


-1 


-1 


-1 -1 


1 


1 



AdjiisttiK'nl 1 is to overall mean of A's and .Kijustment 2 is to center mean of A's. 
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Featherstone found it appropriate to use separate regression lines for 
the two treatments. As in Section 5, all variables are rescaled so that 
the s.d. for all cases together is 100. 

Here I take the posttest on the Preschool Inventory as dependent 
variable and preschool experience as predictor. When the same set of data 
is processed in modes 1, 2, and 3a, the adjusted treatment effects shift 
as follows: 

ND D Difference 

mean mean 

Unadjusted -0.208 0,269 0.477 

1. Adjustment v/ith overall regression -0.075 0.147 0.222 

2. Between-groups adjustment -0.042 0.147 0.189 
3a. Within-groups adjustment -0.095 0.148 0.243 

The between-groups adjustment ^ which I consider the appropriate one, 
reduced the treatment effect to about 85 per cent of that reported by 
the conventional overall analysis, and reduced the numerator of the 
F ratio by 28 per cent. Change occurred primarily in the value for D, 
since b^^ and b^ were nearly the same in D. 
Design 2. Treatments crossed with blocks; collectives nested 

The design in which collectives are blocked gives up some number of 
decrees of freedom, but brings irrelevant variance under tighter control. Once 
'lit ] are <o|iot ted under such a design it seems to make no sense to 
ii^nore the blocking. The blocks may be regarded as fixed (e.g., when the 
"yO St.Ues ^erve as blocks) hut it is probably more common to regard them 
as r..n.lnmly representative of some larger population of blocks. Then the 
.Klequ.iry of t h(^ information depends on the number of blocks in the sample. 

'tippns(> that, within hlork^;, there are just two collectives, one 
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assigned to each treatment. Then, if no covariate is to be considered, 
one might reasonably form the means for the collectives, take the difference 
between treatments in each block, and test whether the mean of the differ- 
enr.es differs from zero. 

C An equivalent 

procedure is a two-way .analysis of variance, with the Blocks x Treatment 
interaction supplying the error variance for the F ratio. Blocking serves 
the Mme function as analysis of covariance, insofar as there are relevant 
initial differences between blocks. VHiatever variables contribute to 
variance in outcome at the block level are extracted; this does not modify 
the estimate of the treatment effect, but it reduces the estimated 
sampling error. 

A covariate may now be introduced to allow for initial differences 
between collectives within the block. The question is, how does tbe mean 
outoome in the collective relate to the initial mean? And how would the 
difference in outcome means be altered if the initial difference were zero? 

Thy p lan of the Head Start study . My thinking about Design 2 was 
stimulued primarily by the famous Head Start evaluation made by the 
Westin^>,hoiise Learning Corporation (hereafter m^C; 1969) and the reanalysis 
by Smith md Bissell (SB; 1970). Both sets of analysts wrestled with the 
problem of units. Although I shall raise questions re^'irdinj; the 
solution- p.ir lorth in those reports, mc :\nd SB w.-re ahead of their time 
in their thinkint; about levels of analysis. The ent»'re body of data 
inclnde-, nndiriKs for full-year Head St.irt pror.r.ins muI sumrier prorjMfPS, 
for whliv (iuMren nnd black rhildren, and for follov-up results marv 
t»-sts i:iv.>n in wrad.-s I, 2, mu\ K I <;fMll st.iv ^M'lhin f h(> d.ita 
prnress*.(! Sv Smith ind Bissell as veil as WIT mJ shili r"<Uirv aftrutfon 
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further to the full-year data on children of all races together, with 
SES as covariate and Total score on the Metropolitan Readiness Test early 
in Grade 1 (MRT) as outcome. I shall not trace the influence of a subtle 
shift in covariates that occurred; in one analysis WLC used a single 
predetermined SES composite and in another formed a three-variable 
composite post hoc by multiple regression (overall); SB formed three 
composites, one from the between-group correlations, one within-groups , 
and one overall (p. 9.18). For my purposes, I shall simply speak of SES 
as covariate. 

Head Start was administered through local offices or centers, each 
with its o rritory; centers were the primary unit of sampling for the 
study. Within a center there was occasionally more than a single class, 
but this was too infrequent to be considered in the design. Classes 
within a center, then, were combined. The pool of Head Start (A) children 
consisted of those who could be tested in first-grade and who had attended 
the renters under consideration. A pool of control (B) children consisted 
of children located in first-grade who came from the center territory, 
who had been eligible for Head Start, and who had not received this or 
other prokindergarten training. Now from the A pool a sample of 8 children 
^ was dra™ for each center; and 8 B children were individually paired 

vith t.ht:m ~ matched on sex, race, and attendance/nonattendance at 
kind.-rcirtrn. Thus the design is Treatments crossed with Pairs of 
rtiihlren. Pairs nested within Centers, one child per cell. Botli WLC and 
^B, h..wfv.>r, i>;nored the matchins at the individual level, and I shall 
ii^nore it ,iUo. I do not see how the information -ould be used 
ronstr.j.t. ivelv in assessing the main effect of treatments. 

n..- .-h.ir.ictt.r of the discriminant may bo noted in passing. In this 
-fu.tv the .ollectives differed on several known variables. In addition 
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to the variables in Table 4.1, center pools differed in racial makeup 
and in che prevalence of kindergarten attendance. This is one of the 
comparatively uncommon instances of a quasiexperiment in which membership 
in a collective is correlated with a post-treatment variable, for reasons 
not arising wholly from initial demographics or from the group as cause. 
(The child's attendance at kindergarten may have been to some degree a 
consequence of his Head Start experience, or of his parents' desire to 
compensate for the absence of Head Start experience.) Whatever the causal 
chain, if children with kindergarten experience are more numerous in 
certain centers, and kindergarten raises MRT scores, the discriminant is 
correlated with Y*X on a basis that can neither be considered a context 
effect nor a demographic effect. 

Altern ative analyses, assuming homogeneity of regression . WLC reported 
two analyses, and SB reported six. Tliese eight by no means cover all the 
possibilities, and I shall argue that none of the eight ~ as I understand 
earh of tht»n — is logically appropriate. It is not precisely clear how 
some analyses were conducted, and in both reports there are mysterious 
differences in the unadjusted mean of MRT from one table to another. Ir 
is not so important to discuss those particular analyses as it is to 
comprehend the range of alternatives and to develop a rationale for choice 
amonp. thon. I nav as well say at the outset that my preference among the 
analvses h^a rhanged more than once as T have «^,tudied the problem, and 
th.it [ Am nor convinced thai I now know what should have hoen done with 

Ifjf li^Mn! an^ilvsis of c*warianro adjMsfs hv caKwlatinr. deviations 
t roT'i t\u T.^iu tor tr*'arm*'nt s.implc, pooling: casi'^ ov(»r t rt^'itrnt-nt s, 

Ami Ar{vrmir)\v^\ f ht- r»-^r<^'ss ion rcK^tfii fcnl for f^iitco^** (U*/ 1, it ions onto 
<ova.i.tt,' l«"/i'.ir ions. This .is^umrs fhaf ^-orrt-spond inj' A uxl V> 
O 
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rt-gryssion coefficients are the same in the population. SB found this 
not to be the case at any level of analysis for MRT-on-SES regressions, 
and .so in some analyses they abandoned the l.omogeneity assumption. I 
start with analyses that assume homogeneity. It will be helpful to denote 
the between-collectives regression coefficient as b^ , that within 

D 

collectives as b^ , and the overall regression which ignores collective 
boundaries as b^ ^. I am assuming here that the two agree, that 

the b^ agree, and that the b^ agree. 

The obvious possibilities follow, grouped in three categories. I 
code them to identify the procedures I think WLC and SB adopted. 
("E.g., WLC! resembles one of the two WLC analyses.) 

A. .Number of individuals as base for d.f. 

A-b. Use of bj^ . (No one has proposed to use this. I list it 

for the sake of symmetry.) 
A-w. Use of b^ . The product b^(SES) is subtracted from 

individual ^^RT scores. Then to compare treat-nents an unmatched 
t-test is carried out on individual adjusted scores. (Where I 
rt-fer to a t-test, the algorithm of anova or ancova could be 
ust-d, leading to an F-test.) 
A-o. Use of . From MRT is subtracted b^(SES). This is 

conventional ancova ignoring the collectives (WIX2). Smith 
•md Bissell made a generalized regression analysis which is 
similar (SBl). (The significance lest is often made incorrectly, 
hv .issessing the siRnificnnco of the increment in mean-square- 
.•:<pl.iinf(I-bv-rc>',r.-ssi.m when the dummy variable for trtMtm.-nt 
M) I, ,Tl(le.i .m > l.,^t priMiictor. Tht- trratmfnl olfcct is 
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b,^f^ and the variance in Y accounted for by treatment 
2 2 

is h^^^ , which may be larger or smaller than the increment 
in mean square. The F ratio is raised or lowered in the 
same way that overadjustment would distort it. 
I see no warrant for taking individuals as the base for degrees of freedom 
when centers were the primary unit of sampling. Both WLC and SB offer 
A-o analyses with the idea that they may be 'more sensitive" to small 
differences, but inflating the number of d.f. simply produces spuriously 
low levels of the jl risk. 

B. Number of collectives as the base for d.f. 

B-b. Use of . Either individual scores or collective means 

cir:* residualized by subtracting bj^(SES). An unmatched t-test 

on adjusted means for collectives is run. There is an analogous 

^generalized regression analysis (SB2). 
B-v. Use of b^ . As in A-o , but adjusted MRT^ are used r.i an 

unmatched t-test. 
B-o. Jse of b^ . Like '\-o , but with an unmatched l-test on 

adjusted !1RT . 

c 

Thf* ^.ilient difficulty here is with the unmatched t-test. Colleci'ves 
are nested within centers, and no method of adjusting for SES can remove 
all the relevant differences among centers. Such differences do not 
r ilsify thv estimate of the treatment effect, hut they unnecessarily 
low*T thr power of the statistical inference. The obvious vav to <'srapf^ 
tUi^ prohh-m is to !ise n matched t-test. Wl^ethor this is profit.iM^* 
'irp^M^I- on»|), b,«t vrM)-r»»ntf"rs vari.ino^ in ar^justed MP'!. 
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C. Numbt-r of centers as the base for d.f. 

C-b. Use of as in B-b. Matched t-test run on adjusted mrt . 

c 

C-w. Use of b^ as in A-w and B-w , with matched t-test on 

means for collectives. tvIC did essentially this with analysis 
of covariance by removing the variance for centers, and then 
using the Centers x Treatments interaction as an error term 
to evaluate the Treatment mean square, after covarying out SES 

on the basis of b (WLCl). 

w 

C-O. Use of as in A-o and B-o , with matched t-test on 

adjusted MRT . 

c 

Which regression makes sense? The use of seems to address the 

question: If we were to search through a large number of collectives, 
disregarding center membership, and were to pair up selected A and B 
collectives so that each pair had the same SES mean, what difference in 
the MRT^ would be expected? I say this, because using b^ estimates 

0 

the expected MRT for a collective regarding which one knows the SES mean 

and nothing else. So this method does not take pairing into account. 

ri does not allow for variables relevant to Y , and orthogonal to X 

on which centers differ, so it loses the value of the matching t-test. 

As for , it suffers as usual from being a composite of b, and b 

b w ' 

the b-regrcssion and the w-regression . If b^^ is off the mark, so is 

• ^'"^''-'g seems to ask almost the right question. If we search 

throuKb the pool of A and B children within a center, and select out two 
sets that have the same SES mea.i, what mean difference in MRT would be 
.^-.ported? That is, the analysis attempts to simulate the result in an 
e-.prriment where equivalent children within a center are assigned to 
ir,M(ments. Hw an.ily.-.is, however, ignores the fact that A children 

1 9 {) 
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were treated in a group. If many collectives were formed from the 
children in the same center and given the Head Start treatment, the 
between-coUectives-within-center-within-A regression need not be the 
same as the regression within-collectives-within-center-within-A. One 
might waive the question of demographic differences (other than SES) 
between such hypothetical classes, but it seems that one must also assume 
absence of context effects to justify analysis C-o . If this argument is 
correct, then, one would need more than one A collective within a center 
to arrive at the logically appropriate adjustment. The problem does not 
arise with B's, who were treated individually and could not have been 
subject to a context effect. 

This meticulous dissection is required to work toward an understand- 
ing of analysis of covariance, but it does not cast serious doubt on the 
results from WLCl, since the amount of adjustment was small. Insofar as 
th.it analysis is in question — aside from the challenges any quasiexperi- 
ment is open to — the question arises from the inhomogeneity of 
rt-gressions, which I deal with next. 

'\lJi^Ln^iiv£ a"Alys es, recognizing heterogeneity of reg ressio ns . 
In tho analysis with homogeneous regressions, one finds out how far the 
A cnsvs are nbovt- the reference regression line and how far the B cases 
are b.-low ii Cor vice versa), combines those differences, and reaches an 
eslinatf of the treatment effect. One would have the same result if he 
former! the two wi thin-treatment regressions (which are parallel) ami 
.letermined the distance between them. The u ..al way of speakinu M'ont 
the an.ilv^is i ■. to ,^p,>,-,k of the " i n t er<ept s" w-here th.- two regression 
lines .-ro-:-; a leference value of X . Althoui;h the < hoice makes no 
-Ii! lere.i, e vt,.-n the r.^jressions are parall,!. it makes best s.-nse m 
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chink of X as at the mean of the grand population. When the regression 
lines art- not parallel, the treatment effect is different at each value 
of X . One can compare their intercepts at the grand mean, and many 
investigators would do just that, to determine the effect of the A - B 
difference for the population of eligible children. SB chose instead to 
determine the A - B difference where X is at the mean of the A population, 
of children who were eligible and who entered. I accept this decision 
■ It least for present purposes. 

The SB technique was to adjust scores of B chilaren only. The B's 
were of higher SES than the A's, on average, so their MRT scores were 
adjusted down, to get an estimate of what the B mean would have been if 
the B's had had an SES mean comparable to that of A children. If the 
B regression coefficient is s , and a B child is u SES units above 
the reference group of B's in SES, then his MRT score is lovered by su 
units. 

An important choice is to be made regarding the reference group. 
The child is, let us say, in a center where the A mean on SES is 7 
(on no matter what scale), whereas the SES mean of all A's is 5. 
ihv child, Jet us say, has an SES of 7; do we take A's in his center as 
tlK- r.-ferctice sxroup and make a zero adjustment, or do we take all the 
A'^. .IS t!,,. reference group and reduce his MRT score by 2s ? It makes 
n.. ditt.-r.-nce in the final report of the treatment effect, but it does 
intl'i.jiKt." the variance. 

!'.> ,how thi^:, U'l MS con'^ider ,ir t i f ic ia ! data ,ind ass-umc an analysis 
• '''^ ""^ adjust tho MKT. ind not individual scores; the argu- 
■•'•'>r woul.i hav.. ihe same flavor if wv adjusted individuals or used another 
r« -I (■ . . I ,iTi ( o,-f } ic i,>iit . 

i!)2 
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The simple data in Table 8*2 show b^^ = 1*0 for B collectives* In 
Center 1 the B mean on SES is 8, 3 units above the mean of all A cases; 
hence adjustment 1 is -3.0* The SES mean of 8 is just one unit above 
the A mean for that center, and adjustment 2 is -1,0* When the full set 
of computations is carried out, we see that the mean over centers for B's 
is the same with either adjustment* The treatment effect changes from 
0 to 1 with either adjustment* But the variance of differences, which 
provides the error term in a matched t-test, is lower when adjustment 2 
is used* It seems to me that adjustment 2 is more appropriate than 
adjustment 1, when matching within centers is intended* (Whether using 
b^ is appropriate is another question*) 

It would be possible to set up categories D, E, F of regressions 
corresponding to adjustment 1. This would be a pointless digression* 
Let me say only that as nearly as I can tell SB4, -which used b^ , 
would be coded D-o , as counterpart of A-o , and SB6, which used b 

w 

would be D-w * With regard to adjustment 2, there is no point in 
detailing categories G and H, counterparts of A and B; let us consider 
just the category that has the proper d*f* (counterpart of C) * 

X* Number of centers as the base for d*f* 

I-b* Use of b, * Regression of MRT for B collectives on SES 

D C 

is determined, and MRT^ for B's in a particular center is 
adjusted according to the discrepancy of their SES^ from that 
of A's in the center* Matched t-tzest is entered with the 



mean MRT^ of A's and the adjusted MRT of B's 



(SB5)* 



I-w. 



Use of b 



Similar to I-b save that b 



is calculated 



w 



w 



for the B's* 



I-o. 



Use of b 



Similar to I-b, using b 



o 
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Once more we face the question, which regression? It seems to me 
that using is wholly appropriate. B's were treated individually, 

and they were identified individually within centers. Hence one wishes 
to estimate the probable MRT score for individual children within a 
center who have a particular SES mean. I can see no way in which 
between-center differences among children (which enter the other two 
regression coefficients) become relevant to a within-center adjustment. 

Although analysis l-w, which I favor, came to light because 
regressions were heterogeneous, I believe it also would be justified 
if corresponding regressions had happened to be homogeneous. It will 
be recalled that the regression coefficient required to recognize the grouping of 
A cases could not be evaluated. The Head Start design, with one treat- 
ment given individually and the other treatment given to groups, treatments 
being crossed with primary sampling units, is highly exceptional. The 
general conclusion is not that some analysis is generally to be 
recommended but that any investigator proposing to use ancova on 
Design 2 must reason carefully to settle upon the proper analysis. 
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Notes for Section 8 



8.8 ^Since we calculated from the adjusted scores, the computer printout 

showed 370 d.f. , and a higher F . Taking the number of degrees of 
freedom as 1, 369, the F ratio is significant at the .01 level. 

2 

8.14 This could be called b^ , comparable to the 6^ of Section 3. 
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9. Multivariate considerations 

In the course of this project we wade several multivariate analyses, 

and gained some experience in thinking about the decomposition of 

multivariate relations. I shall discuss those materials only selectively 
and briefly, since consensus regarding univariate analysis needs to develop before 
we try to resolve multivariate issues. 

Simple correlations 

Since Robinson and Yule and Kendall, it has been recognized that 
correlations change in going from the aggregate to the individual. I am 
interested in a comparison across and within groups, whereas previous 
writers contrasted correlations across groups with correlations 

across individuals regardless of group. 

The kind of issue that arises for the psychologist can be seen if we 
consider convergent and divergent thinking. A good many investigators 
have argued about the degree to which these are correlated, and correlations 
ranging from zero to fairly large positive values have been reported. Those 
correlations have typically been calculated by measuring schoolchildren in a 
number of classes and pooling all cases. It is reasonable to suppose that 
the classroom can have an effect on the level of divergent thinking (D) for 
children who stand at the same point on convergent thinking (C); Torrance 
and others expect certain tactics of teachers to inhibit divergent thinking. 
If the teacher's effect on D is uniform over the range of D, 
and unrelated to the cla<5s mean on D, ^'t)c(w) ^^^^ 

exceed gj^^^j^j. The overall regression coefficient will fall between them. 
The three correlations will similarly be discrepant. It would appear, then, 
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than an attempt to sort out within- and between-groups relations is necessary 
to pursue any argument about the structure of abilities. However, the 
within and between relations differ because of demographic effects when group 
membership has no causal consequences. Computing separate correlations or 
regressions adds information but leaves interpretation equivocal. 

Correlations of reading outcomes . In exploring the Cooperative 
Reading data our eyes were caught by the correlations between subtests of 
the Stanford Achievement Test within the LE and B treatments* In the 
conventional correlation matrix (over all individuals) the correlations 
of Spelling with other subtests were conspicuously lower in the B treatment. 
This could be of substantive interest. The LE program is, on its face, a ' 
more integrated approach to language and as such would perhaps 
generate higher correlations among outcomes than the Basal method. 

^{ The obvious next step was to decompose the correlations. A typical set of 
values is that relating Spelling to Word Reading: 

Within LE Within B 
Conventional 0*76 0.54 
Between-classes 0 . 90 0 . 83 

Wi thin-classes 0.61 0 . 39 

This result, and others, seemed to indicate that the treatment chiefly 
affected within-classes correlations. The fact that between- 
classes correlations were consistently large is also of interest- Although 
correlations of aggregates often are large, it would be possible for 
teachers tc vary the proportionate emphasis they give to different outcomes, 
and if so the between-groups correlations would fail off • 
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Correlations are affected by variances, and if the groups were 
selected differently in the two treatments, or moved farther apart in 
one treatment than another, this could account for differences in 
correlations. In fact, the within-cJasses variances proved to be much 
the same across treatments in all subtests except Spelling. In .Spelling, 
the within-classes variance for B was more than double that for LE. This 
may be an important substantive finding, and one that is less striking 
in the conventional analysis. Here is the full set cf variances: 

Reading Spelling 





With' . LE 


Within B 


Within LE 


Within B 


Conventional 


47.65 


47.67 


32.62 


48.87 


Between classes 


22.28 


16.32 


17-81 


15-84 


Within classes 


25.40 


31.36 


14.82 


33-06 


Estimate of r\ 


0.46 


0.34 


0.55 


0-32 


The higher intraclass correlations 


for LE are consistent with 


the somewhat 


higher intraclass correlations for 


LE on the Pintner pretest 


(0-37 ys. 0-23) 


We regressed Reading 


on Spelling and Spelling 


on Reading, 


obtaining 


these coefficients: 












Reading on 


Spelling 


Spelling on 


Reading 




Within LE 


Within B 


Within LE 


Within B 


Between classes 


1.06 


0,84 


0*81 


0-82 


Within classes 


0^84 


0.38 


0.47 


0.40 
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The one clear finding is that between-classes regression slopes are 
considerably larger than within-classes abpes. Similar discrepancies were- 
found for other pairs of variables. This finding is not, X think, to be 
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dismissed as a consequence of greater measurement error in the individual 
scores* Rather, it is a statement that becween-ciass differences in 
measured achievement are highly stable across outcomes in these elementary 
schools. Perhaps a part of the higher relation arises from conditions of 
test administration; correlated errors due to high or low group morale 
would make the regression steeper. More likely, the crucial fact is that 
individual patterns of difficulty — the good reader who is weak in 
spelling and his opposite — lower the within-class correlation but 
/ balance out over the class. Between-class differences in one subject 
would not be predictable from differences in another if there were 
a strong tendency for one teacher to out more emphasis on Reading 

(relative to Spelling) than the next teacher, or to have greater 
success in teaching one subject than another. 

One might, as he prefers, emphasize the similarity of the Spelling-' 
on-Reading regressions across treatments or the dissimilarity of the 
Reading-on-Spe!ling regressions. The proper conclusion 

appears to be that (at least in the sairtples) the within-groups joint 
distribution in B is d-»'stinctly different from that in LE, the former 

having a much greater dispersion. Another way of summarizing the same 

2 

information is to emphasize the difference in n (greater in LE for both 
variables)* Since (perhaps fortuitously) on the Pintner pretest was 
greater for LE, interpretation must be left open. 

The methodological moral of this exercise is that correlations among 
variables may be calculated within and between groups, but should not be 
inretpreted by themselves. The information of importance is contained in 
the joint distribution of X,Y means and of X,Y deviations expressed in a 
uniiorm X metric and a uniform Y metric. Assuming that all distributions 
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are normal, three parameters describe each distribution shape (and two 
parameters describe location). Correlations are derived from standardized 
measures, and the standardization of a variable is different for each 
distribution; contrasts are invariably distorted. 

For similar reasons, use of unstandardized path coefficients 
generally is recommended. Because a direction of relationships has been 
postulated, interpretation is simpler than in the Reading-versus-Spelling 
example. Path coefficients have often been calculated from disaggregated 
data. It appears advisable to partition the structural regressions, 
making between-groups and within-groups analyses, despite the probable 
equivocality of the findings. (See also p. 9.23.) 

The comments made here apply to Harnqvisfs analysis of relations 
outcomes in the International study, mentioned on p. 2.12. He not only 
shows some striking differences among correlations at the individual and 
aggregate levels but makes the suggestion that the disaggregated correlations 
be recomputed for individuals within schools and schools within countries. 
He sees the lines he has opened up as dealing with some highly significant 
substantive questions. If my reasoning above is correct, the correlations 
ought to be supplemented by the pertinent variances, to give a sense of 
the joint distributions. Only this can give the reader a basis for interpretation. 
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Component analysis and factor analysis 

Variables are reorganized into components or factors for three purposes: 

(1) Orthogonalization. Even if all the information from the original variables 
is retained in the orthogonal variables, it is often easier to carry out 
calculations and to make plots and summary statements in terms of orthogonal 
components. The simplest case of this kind of simplification is the change 

of variables X and Y to the set X and Y»X, the latter being a partial variate. 

(2) Rank reduction. There is redundancy in almost any set of variables, and 
the set can perhaps be compressed to fewer variables without much loss of 
information. Use of a limited number of components or factors simplifies, 
and relationships involving fewer variables will often crossvalidate in new 
samples better than relations fitted to a large number of variables. 

This is why minor factors or components are ordinarily discarded. 

(3) Identification of constructs. The purpose of rank reduction is to 
arrive at simple, stable empirical statements. The purpose of rotation of 
the factor set is to arrive at simple, stable descriptive or theoretical 
propositions. Those who rotate a set of such variables are searching for 
whrt are often called "underlying dimensions". Perhaps it is better to 
think of these as constructs, as working hypotheses regarding variables 
that can be used to formulate a satisfying theoretical network. If a 

good set of variables is found, reldtionships can be summarized in sentences 
that are comparatively simple, in the sense that each proposition employs 
only a few constructs in the set — even though all the constructs are 
important enough to enter some sentences. (The reader may recognize 
Thurstone's conct?pt of simple stiucture as an Illustration of this 
desideratum.) 
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The partial variate Y-X is defined as Y - Q^^y. . Since we have seen 
that the regression coefficients from between, within, and overall analyses 
differ, three distinct partial variates will be formed by them. The Y«X 
formed in an overall analysis will not ordinarily be uncorrelated with X 
either within groups or between groups. More generally, a variable set 
that is mutually orthogonal in one of the three analyses will almost certainly 
not be orthogonal in the other two.-^ Insofar as an investigator is primarily 
interested in orthogonaliz tion, then, he may need separate orthogonaliza- 
tions for the betweer- and within-groups segments of the data. As a 
minimum, he must decide which set of intercorrelations he wishes to reduce 
to zero. 

A similar comment is to be made regarding rank reduction. When the 
first dimensions from a larger set of n^ variables are retained and the 
remaining information discarded, this process will discard a fraction of the 
between-group information and a fraction of the within-group information. 
Those two fractions may differ in amount and in character. Suppose that 
the analysis is made within groups, and the first three factors retained. 
Those factors may account for 80 per cent of the total variance within groups 
on all variables; they may account for 92 percent of the total variance 
between groups, or 70 per cent. When an overall analysis is carried out, 
the first component may arise largely from bp.tween-groups variance, or 
largely from within-groups variance, or from a mixture of the two in any 
proportion. ITie same is true of each later component. It is unlikely that 
several successive factors would arise from the same single source, unless 
the groups were formed at random and the between-groups information is 
nothing but noise. The practical implication is that a person who reduces 
his data on the basis of a single factor analysis at any one of the three 
level.s retains the major fraction of the information at that level, while 
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perhaps discarding a signiticant fraction of the information at one of 
Che other levels. Inhere rank reduction is the aim, it is important to 
examine separately the within-group and between-group residual covariances 
or correlations, to make certain that they are negligible. 

The most intriguing problems arise in the attempt to establish 
constructs on the basis of factor analysis. As was said earlier, variables 
that have the same operational definition may have different substantive 
interpretations at the individual and the aggregate levels. A factor is a 
weighted composite of observables (or variables that are in principle 
observable), hence the preceding statement applies to any factor. A composite 
that enters into simple between-groups relationships may have quite different 
relationships with corresponding individual-J evel variables (within groups 
or overall). The person using factor analysis as a tool in theory construc- 
tion, then, will need one set of factors for his between-groups theory and 
another set of factors for his within-groups theory. To be sure, he may 
find that the two sets of constructs coincide, but that is a possibility to 
be evaluated, not assumed. 

Discriminant analysis is the one context where separare multivariate 
decompositions have often been made within groups and between groups. 
Discriminant analysis is an attempt to describe differences between groups 
in terms of one or a few variables. In Fisher's famous example of two 
species of iris, a number of physical measurements were made on many specimens 
cf each species. (More than two species could have been investigated.) The 
analysis; reduced the measures to two composites which were sufficient to 
classify plants into the two populations with few errors. The first 
discriminant function is whatever composite has the largest intraclass 
correlation. TIu second is whatever composite of the remaining information 
has the largest intraclass correlation. And so on. As a first step in the 



analytic procedure, the within-groups variance-covariance matrix is 
factored into orthogonal components and these are standardized (within groups) • 
A consequence of this standardization is that when an^ dimension is partialled 
out (as in going from the first discriminant function to calculation of the 
second), the multivariate distribution of residuals can again be described 
by orthogonal variables with unit variance. The means of the groups on the 
components are <=ormed, and the first principal component of the between-groups 
covariances becomes the first discriminant function. Successive principal 
components become successive discriminant functions. The first two or more 
discriminant functions can be rotated if that is thought to give a more 
'•meaningful'' description of group differences. In the study of irises or 
other similar pools, the rotated discriminants might suggest something 
about the character of the genotypic differences between species, 

A rather large number of factor analyses of aggregate data have been 
made, R, B, Cattell (1949, 1952) suggested that a group has a "syntaliuy" 
analogous to the personality of an individual, and he paralleled his studies 
of dimensions of individual differences with some factor analyses of group 
differences. For other summaries or discussions of aggregate-level factor 
analyses, see Janson, 1969; Cartwright, 1969; and Tryon and Bailey, 1970, 
So far as I know, only Slatin (1974) has carried out factorizations of the 
same data at two levels* I discuss his study below. 

Whether an investigator should want factors for between-groups and 
within-groups variance is a subtle decision; in some contexts, factors 
from an overall analysis are no doubt appropriate, l^hen groups were formed 
by aggregation rules that are irrelevant to the matter at issue, the 
between-groups factors that reflect those rules will be of no importance. 
In Fisher's study, on the other hand, the aggregation represented a judgment 
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that two pools of plants were distinct biologically. Therefore the pooled 
set of individuals represented an arbitrary mingling of between- and within- 
gioups information. The causes of variation between groups were probably 
not the causes within groups, and the structure might well have differed 
from one group to another. The psychologist has most often regarded 
effects as strictly individual. Even Cattell, in his studies of individual 
traits, has analyzed overall correlation matrices, ignoring groups. This 
may be appropriate in some circumstances but probably not in all. If it is 
true that teachers affect scores on divergent thinking or spelling, to mix 
class-level differences into a study of individual differences gives us 
indirect and clouded information about individual growth in ability patterns. 
On the other hand, to use within-groups information as the basis for similar 
conclusions is a dubious practice, insofar as arbitrary or irrelevant 
assignment rules restricted the range on some variables and modified the 
intercorrelations . 

We move now to an illustrative factorization of ability tests which 
will give some concreteness to what has been said. This set of datd was 
examined some time ago, as an exploration. It was a poor choice from a 
substanti/e point of view. The tests were given to first-graders early in 
their school careers, and the between-group differences reflect neighborhood 
differences or rules for assigning children to classes rather than psycho- 
logical causes. The betwecn-group information is essentially a summary 
of demographic effects. Despite the likelihood that the overall analysis 
probably answers the questions most likely to be asked about these data, 
much can be learned from the contrasts among the analyses. 



9.11 

Analysis of correlations . Miss Webb analyzed three correlation 

matrices (overall, between classes, within classes) for eight pretests 
the 

in^Bond-Dykstra data, using a total of 1049 cases from the B and LE 

treatments combined. (We had no reason to consider treatments separately.) 

Analyses were made with unity in the diagonal and also with estimated 

communalities; no insight will be lost by discussing just the latter at this point. 
The 

^irst fact of interest is the high degree of multicollinearity in the 
between-groups correlation matrix, so high that the connnunality for the 
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Pintner score was 0.99. (As a consequence, the computer's attempt at 
varimax rotation produced a nonsensical result.) I rotated factors II 

and III in the within-grr ips analysis through 45"*, to bring them more 
nearly into line with the corresponding factors of the other two analyses. 

Table 9.1 presents the factor loadings, communalities, and percentages 
of variance accounted for by the first three factors in each analysis. 

t It will be noted that 

the communalities were considerably higher between groups than 

within groups, except for bCopying and bidentical Forms 

which had large unique factors. Correspondingly, the common factors 
accounted for a larger fraction of the between-groups variance than of 
the wi thin-groups variance. 

The reader may plot the loadings for himself. He will see that the 
structure in the conventional analysis corresponds far more closely to 
the between-classes analysis than to the within-classes analysis. This is 
true even though the intraclass correlations were only about 0.30 (see 
below). Factors I and II in the conventional and between-groups analyses 
plot out as a quasi-simplex. The order in which the tests string out is 
identical except for the position of Listening. The within-classes 
analysis produces a two^cluster configuration: wCopying, widentical Forms, 
and wPintner fall along one vector and the other five tests cluster on 
another . 

The table codes the tests differently in the three factor analyses to 
remind us that the between-classes and within-class ^ analyses look at 
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Table y.l. Three factor- analyses of readiness measures 



Conventional 






Between classes 






t'Vt 4 t-i 

wicnin 


classes 








I 


II 


III 


^2 
h 




bl 


bll 


bill 




wl 


wll 


will 




Pintner 


7 




-.2 


.68 


bPintner 


.8 


-.3 


-.4 


.99 


wPintner 


.8 


-.4 


-.1 


.75 


Phonemes 


.7 


.0 


.2 


.55 


bPhoaemes 


.8 


-.1 


.1 


.68 


wPhonemes 


.6 


.2 


.2 


.4 8 


Letter Naming 


.7 


.2 


.2 


.54 


bLetter Naming 


.8 


.2 


.2 


.70 


wLetter Naming 


.6 


.2 


.3 


.4 9 


Learning 


.6 


.4 


.1 


.49 


bLearning 


.7 


.5 


.1 


.74 


wLearning 


.5 


.2 


.2 


.32 


Copying 


.4 


.2 


-.2 


.28 


bCopying 


.3 


.5 


.1 


.35 


wCopying 


.5 


-.2 


.1 


.32 


Identical Forms 


.4 


.2. 


-.2 


.26 


bidentical F. 


.4 


.2 


-.1 


.18 


widentical F. 


.5 


-.2 


.0 


.25 


Word Meaning 


.6 


-.4 


.2 


.52 bWord Meaning 


.7 


-.6 


.3 


.90 


wWord Meaning 


.5 


.2 


-.4 


.49 


Listening 


r 
• J 


-.3 


-.1 


.47 bListening 


.7 


.0 


-.1 


.55 


wListening 


.5 


.1 


-.3 


.29 
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different varia les. bPintner is the class mean (standardized after averaging), 
and wPintner is the individual deviation from that mean (likewise standardized). 
Because of standardization, the 

Pintner variable of the conventional analysis equals bPintner -f 

2 2 
(1 - n )wPintner, where n is the intraclass correlation for Pintner. 

Factoring covariances . The standardizing operation will distort, 
information. If a variable has a small intraclass correlation, it has 
a small between-classes variance yet it has as much "weight" in a between-classes 
factor analysis of correlations a variable with large variance. Conversely, 
a variable with little within-classes variance is given heavier weight in 
the within-classes analysis when correlations are used. Our next step, then, 
was to partition the overall correlation matrix of the scores into 
between-groups an .-groups covariance matrices, and then to factor 

those matrices.^ We started with correlations because 
there seemed to be no reason for weighting one variable differently 
from another in the overall analysis; one could, however, start with 
the covariance of raw scort;,s or of scores rescaled in some preferred manner. 
Any such scaling decision affects which variables dominate the first 
principal components of the overall analysis. Having begun with ones 
in the diagonal of the overall matrix, we had intraclass correlations 
in the diagonal of the between-groups matrix and values of 1 - in 
the within-groups diagonal. 

In the bt:tween-classes analysis, three factors accounted for 
79 per cent of the variance, and little variance remained to be 
accounted for in subsequent factors. Therefore, Table 9.2 presents 
only the first three factors. 
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Table 9.2. Analysis of covariance matrices for readiness measures 



Between 


classes 










Witliin 


classes 






bl 


bll bin 




2 




wl 


wll 


will 


wIV 


wV 


a. i| 


bPintner .22 


-.18 


.02 


.22 


.30 


wPintner 


.63 


-.07 


.31 


-.02 


-.15 .52 


.70 


bPhon'idQes .25 


-.15 


-.03 


.25 


.32 


wPhonemes 


.58 


.24 


-.18 


-.14 


-.26 .51 


.68 


bLetter Naming .23 


-.03 


-.04 


.21 


.29 


wLetter Naming 


.57 


.30 


-.17 


-.05 


-.09 .46 


.71 


bLearning .22 


.11 


-.01 


.20 


.31 


WLearning 


.47 


.38 


-.28 


.26 


.33 .61 


.69 


bCopying .25 


.58 


-.19 


.50 


.54 


wCopying 


.35 


.04 


.24 


.06 


-.26 .25 


.46 


bldenticcl Form. 15 


.07 


.53 


.34 


.35 


wIdenticalForms 


.43 


.05 


.52 


.09 


.30 .56 


.65 


bWord Meaning .19 


-.25 


-.14 


.23 


.31 


wWordMeaning 


.50 


-.30 


-.13 


-.50 


.26 .67 


.69 


bListening .13 


-.03 


-.01 


.09 


.16 


wListening 


.54 


-.60' 


-.23 


.35 


-.04 .83 


.34 


% of variance 46 


21 


12 


79 




% of variance 


39 


14 


11 


9 


8 81 
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In Che within-groups analysis, variance was extracted more slowly, and five 

the 

factors are retained. Neither the b nor^w analysis was particularly close 

to the overall analysis of covariances (not shown here); it could 

be described roughly as "halfway between" the two. 
2 - 

The n was particularly low for Listening, and particularly high for 
Copying. Possibly an explanation could be constructed from a search for 
ceiling and floor effects or other anomalies; alternatively, the patterning 
could reflect something about neighborhood characteristics. Since the 
children were tested near the start of the first grade, causal "class" 
effects are highly unlikely, except as irregular administration of tests 
affected class standings. 

be tween-g roups 

The three chief components of the^covariance matrix are not much like 
the first three components of the correlation matrix. (Some of this shift 
comes about because we decomposed the entire covariance, rather than just the 
common-factor portion as before.) The general factor runs over all tests 
about equally, except for Listening and Identical Forms. Components bll 
and bill are essentially sffecific to bCopying and bidentical Forms. 

The within-groups analysis shows fairly strong common factors. 
Listening loads more heavily than in the between-groups analysis, because 
of its small intraclass correlation. The first factor within groups is 
present about equally in all measures. Rotation could bring out the 
connaction of wPintner with wCopying and the connection of wPhonemes with 
wLetter Naming, but the structure is not strongly patterned. 
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Slatip's analyses. Slatin (1974) factored 10 variables at the group and 
individual levels. His data were measures of ability and family background 
for boys, plus two indices of property value for their neighborhoods. His 
aggregates were areal units, such that the 516 boys were successively clustered 
into 47, 21, and 10 areas. Slatin factored each of four correlation matrices, 
extracting and rotating three factors. He was impressed by the differences 
between the factor structures, and suggests tentatively that a more sociological 
(more "social") explanation of phenomena will be reached when aggregate data 
are factored than when individual data are factored. 

Our work perhaps sheds some light on Slatin's findings. The most striking 
change in going from the individual analysis to the aggregate analysis was a 
shift in loadings for Age. Pairing up the varimax-rotated factors of the four 
analyses, the loadings for Age are 

I II 

516 individuals -.22 -.10 

47 smallest areas .03 -.74 

21 medium areas -.19 -.92 

10 largest areas -.24 -.95 

Factor I is an ability factor and the chief markers for II, other than age, 

represent neighborhood wealth or father's status. Age had a much lower 

intraclass correlation than several other variables (Slatin, 1969), and hence, 

when variables were restandardized, tiny and fortuitous covariances across 

groups were inflated to the point of making Age a powerful influence in the 

aggregate-level analyses. (One covariance of about 0.01, I estimate, was 

inflated into a correlation of 0.72.) If covariances had been analyzed 

instead, the analyses at successive levels would have changed only 

grndunlly. At the individual level, there is a near-simplex running through 

the ability measures, then around to lot size and value of dwelling unit (DU) . 

(Age and Delinquency have such low correlations that they do not enter the 

simplex.) Much the same simplex appeared in the aggregate covariance matrices. 
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If covariances had been factored, I believe that the only change would have 
been a reduction of the spread of the vectors; at the highest level of 
aggregation IQ and DU correlated 0.67. 

It might have been wise for Slatin to have examined factors within aggregates. 
The relation of individual characteristics after neighborhood is held constant 
might be the best way to bring more purely psychological relations to the 
surface. But it might be equally interesting to decompose "upward", factoring 
values of X'X to see what "social" factors might appear after the individual 
information was removed. 
Suggestions . 

^Insofar as investigators seek only to replace original variables with a 
smaller number of composites that carry most of the differential information, 
no serious problems arise. There is no reason to try to establish homology 
between factors at various levels, and one will of course factor at whichever 
level he is interested in. His only major decision will be whether to 
standardize variables at that level or to use some other metric. 

It is when factors are to be regarded as constructs that interpretation 
becomes awkward. It is unreasonable to expect homology. If only for statistical 
reasons, different results are likely to appear at the between-groups and 
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within-groups levels, the grouping variables that modify the regression 
coefficients (as shown in Section 4) also modify the covariances, even when 
no causal effects are associated with the groups. Beyond this, however, 
original variables take on different meanings when aggregated. They can be 
expected to cluster differently, with the consequence that different constructs 
will be appropriate between and within groups. Let us consider total yields 
of wheat and potatoes (rather than per-acre yields). More agricultural 
counties will have larger yields of both potatoes and wheat. But within 
counties the farmer decides whether to plant potatoes or wheat, so the two 
yields may be negatively correlated at the level of the farm. In some problems 
it may make sense to track down just this patterning; in other problems 
where group boundaries seem to have little causal significance an overall 
analysis will suffice. 

Brunswik's "ecological" ideas will help in interpretation. Any result 
obtained from a sample of persons is also representative of the sample of 
subecologies in which their behavior developed. If one samples groups and 
measures everyone in those groups, the correlational information is a state- 
ment about the distribution of behavior within and between groups, hence 
within and between subecologies. The results generalize over the population 
of groups sampled, ^en sampling is at the individual level, perhaps from a 
census roster, the result holds for persons who have grown up in a culture, 
distributed over its subecologies. The only difference from the study where 
groups were sampled is that with individual sampling there are too few persons 
from any one subecology to warrant examining its specific characteristics. 
From this point of view, Chen, the overall correlation describes an ecology 
in the large, the between-groups correlation contrasts subecologies within 
that ecology, and the within-groups correlation describes typical relations 
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within subecologies. The factor analyst who intends to study a purely 

psychological question about the distribution and covariation of abilities 
is inevitably reporting on a phenomenon that is cultural, demographic, and 
sociological in part. 

The overall covariance being a composite of between-groups and within-groups 
covariances, the adequacy of the overall data depends on the adequacy of the 
data at the two levels. Our Bond-Dykstra analysis used 1049 cases, and that 
is ordinarily enough to satisfy any factor analyst. But the information for 
the important between-groups portion of the covariance comes from a sample of 
57 classes. Few would consider 57 cases a sufficient sample for a factor analysis. 

If we assume substantial homogeneity of relations within the several 
classes — which is probably necessary if a within-classes factor analysis is 
to be taken seriously — then the within-groups covariances are well established 
when 50-odd classes are pooled. Yet insofar as Table 9.1 is representative, 
the overall analysis that has conventionally been made rests far more on the 
fallible between-groups covariances. Now if within-class relations are not 
homogeneous, the whole analysis is suspect. The pooled-within-class analysis 
has uncertain meaning, and its stability is a function of the number of 
classes, not the number of individuals. 

The case for considering separately the between and within factor 
analyses is especially impressive when we move out of the ability domain. 
An example is the Learning Environment Inventory (LEI) developed hyg^ Anderson 
and Walberg (1974) within Harvard Project Physics. This is a collection 
of items describing instructional procedure and student attitudes; the 
student responds so as to report how he perceives his class. Items have been 
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assembled into scales on the basis of their intercorrelations, and the scales 
have been factor-analyzed. Insofar as correlations arise simply from semantic 
overlap of items, one would expect similar joint-distribution shapes within 
and between classes. But the correlational structure, insofar as it reflects 
psychological differences, may be quite different. Within the class, the 
correlation reflects the correspondence of perceptions. Are the students 
most prone to describe the class as "apathetic" also the ones prone to 
describe it as "difficult"? Across classes the correlation speaks of a 
different phenomenon: When the students collectively describe the class as 
"difficult", do they also describe it as "apathetic"? The former refers to 
the phenomenology of the student, compared to other students rating the same 
events. The latter refers to behavioral differences between classes (though 
some of those differences are perceptual rather than objective). 

The purpose of the LEI is to identify differences among classrooms. For 
It, then, studies of scale homogeneity or scale intercorrelation should be 
carried out with the classroom group as unit of analysis. Studying individuals 
as perceivers within classrooms could be interesting, but is a problem quite 
separate from the measurement of ^environments. 

of units 

Empirical test construction . Once the question's raised, all empirical 

test construction and item-analysis procedures need to be reconsidered. Is it 

better to retain items that correlate across classes? or items that correlate 

within classes? a correlation based on deviation scores within classes indicates 

students 

whether students who comprehended one point better than most also comprehended 

A 

the second point better than most — instruction being held constant. A 
correlation between classes indicates whether a class that learned one thing 
learned another, but this depends first and foremost on what teachers assigned 
and emphasized. It is the items that teachers give different weight to 
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that have the greatest variance across classes* If some teachers work hard to 
teach use of the semicolon and others consider it unimportant, the semicolon 
items will correlate comparatively high over classes • If teachers who care 
about semicolons may or may not care about colons, a low across-classes 
correlation between s.emicolon and colon items is to be expected. This leads 
us to regard the between-group and within^group correlations of items as 
conveying different information, and makes the overall correlation for 
classes pooled an uninterpretable blend. 

Multiple regression and i elated techniques 

A school-effects model . It makes sense to consider two or more measures 
on the individual or the collective for many purposes* A simple school-effects 
study might include == family background, Y.^ = ability of teachers, and Y = 
student achievement. Suppose that data in just one community will be examined. 
It appears best to identify the of the student's own teacher. A conven- 
tional analysis might evaluate 

A more sophisticated individual -level analysis might add contextual variables 
as a last step: 

■ «l-2''lp^«2-Ap*'3.12As^»4-123''2s 
Here, X^^^ and X^^ are school-level aggregates. 

The fact that the predictors — especially X, and X^ ~ tend to be 

Is 2s 

correlated creates difficulties of interpretation. The difficulties are 
exacerbated by the fact that the number of schools is small. Consequently, 
^is, 2s likely to vary considerably from community to community. This 
is a fact about ioca: affairs, acceptable enough in considering fixed schools 
in a fixed community. If communities are compared or an attempt is made to 
ERIC'"^^^^ over communities, it is highly likely that the g that one wishes 
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to interpret will be much affected by the value of p. Particularly if p^^ 
is large, the 63.^24 ^"^^ ^^.123 ^^'^ ^e compared only at considerable risk!' It 
is not at all unlikely that their balance would shift in another year in the 
same community even if p^^^ ^oes not change. 

A decomposition seems to require aggregation at the class level (c) as 
well as the school level (s) . In place of (9.2) we have three equations: 

s si. 2 Is s2-l 2s 

- ^1.2^c-V ^ ^c2.1<^2c-V 

Assume that all definitions of parameters take number of students' into account. 
What is here written as 3^,.^ could be written a,,^.^^; in the notation of (9.2) 
it would be 33., - With no partialling of 1 and 2. ^1,,, . 
entered £ro_forma. Teacher quality X^^ cannot be dis^^^Tegated , henTT"^' 
^2p = ^2c ^^'^ vanishes. A global school quality would vanish from 

c and p equations.3 xhe sums of squares from (9.4) and (9.5) can be combined 
into components of the within-schools SS. 

The correlation p^^^^ i« "ow relevant only to (9.3). The usual problem 
of allocating variance between two correlated predictors (p. 2.17) arises at 
the school and class-within-school levels, but p^^^s applies to one and p^^^e 
to the other. In general, of course, interpretation of an equation at thl''' 

within-class level takes p^ . into account. 

Ip2p 

The important point to remember here is that ^^^,2' ^ , and 6 

are coefficients for different predictors, predicting different outcom'es^ 

Th. overnli analysis of (9.1) evaluates only two out of five parameters, oven 

the analysis with added contextual vTrioKi^^ 1 

utextuai variables leaves the components entangled. 

From an explanatory point of view a single equation fitted to the ecology 

^ provides less information than the set of equations. 
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Partialling . A multiple-regression equation gives regression weights 
for one variable with others held constant. A partial correlation relates 
two residualized variables. What usually goes unrecognized is that the 
variable carrying the same label becomes an operationally different variable 
at each level of aggregation. If we form Y^*X^ at the aggregate level for 
some group c, that value will rarely equal the average for c of the Y^X 
formed by partialling at the individual level. 

Harnqvist (1975, p. 102) reports partial correlations (at the end of 
secondary school) for achievement in literature with achieveirent in science 
with reading comprehension held constant: 

Iran England 

Individual level 0.25 -0.13 

School level 0.36 -0.60 

This is certainly of interest, as a kind of documentation for the "two cultures" 
stereotype of British education. Where we must be cautious is in believing 
that the same pair of variables has been correlated in each instance. The 
variables may be denoted (with some notation that should be transparent) as 
L - b^Rd and S: - b2Rd. But both bj^ or take on a new definition and 

a new numerical value in each cell, as follows: 

^bl ^bE 

It seems to me highly dangerous to compare variables across levels and 
across collectives when the operational definitions shift. To argue that 
the several operations represent the same construct seems to entail an enormous 
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burden of proof; in one conC-xc reading may be a proxy for individual SES 
and in the next context it may be more a proxy for global school quality. 
This argument applies obviously to path analyses, since most of their 
predictors are partial variaces. 

I can make only one recommendation and it is no more than a palliative. 
Recall that in the Anderson reanalysis (p. 5.8) Webb and I defined a 
variable ABIL = ABILITY - 0.47 PRECOM; where 0.47 was the overall regression 
coefficient of ABILITY on PRECOM (within treatments pooled). We entered this 
variable in the within-collectives and between-coUectives analyses. The 
correlation of ABIL with PRECOM in each of these analyses was close enough 
to zero that we were able to reach interpretations; we did not have a mind- 
boggling shift in definition. I speak of this procedure as a palliative 
because one is not guaranteed that in each subset of the data the variable 
so formed will have a low correlation with the variable whose contribution 
was adjusted downward. 



Tnble 9.3. Zero-order and multiple regression coefficients 

for Head Start classes at two levels of aggregation, 
with Metropolitan Readiness Test as dependent variable 





POPED 


POPINC 


POPOCC 


NKIDS 


Between centers 
Zero-order 


3.47 


6.06 


6.72 


0.72 


Multiple 


3.21 


4.21 


2.40 


2.18 


Within centers 
Zero order 


3.89 


1.56 


1.39 


-0.34 


Multiple 


3.61 


0.50 


0.58 


-0.0] 



Based on Smith & Bissell, 1970 
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The multiple-regressioa equations given by Smith and Bissell, two of 
which appear in Table 9.3, are further evidence on the same point. The 
between-centeis equation tells a quite different story, on its surface, than 
the within-centers equation does. All variables contribute to the former 
equation; within centers, only POPED has a large weight. (All variables had 
been rescaled to similar metrics.) Some of the disparities are explainable 
in terms of the zero-order coefficients, but the shift in NKIDS is not. The 
difficulty lies in the fact that the weight is for "NKIDS with the other 
three variables partialled out". Since the correlations among predictors 
changed from one level to another, the partial variates are radically 
different in their definitions fron one analysis to the other. 

Disparities such as these (including disparities across treatments) 
caused no difficulty for Smith and Bissell, since they were not attempting 
a causal interpretation of regression weights. Sociologist,^ often do attempt 
such interpretations, however. Some set of operationally defined 
quasiorthogonal composites would appear to be necessary for any comparison 
of regression coefficients across levels or across treatments. One might 
be wiser to examine zero-order regression coefficients for such composites 
than to use multiple regression, but path analysis calls for multiple regression. 
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Notes for Section 9 

X 

P- 9.7 Duncan et al . (1969, p. 54) establish the following relationship: 

2 2 V2 2 -) 

= (HxHy) + [(1 - n^Xl - nj)] r^ 

2 * 

p. 9.13 A note on procedures may be helpful to some readers. The 

SPSS programs "do not accept" covariance matrices, but since the 
analytic routines for correlation matrices apply to covariance matrices, 
we fooled the computer into thinking it was analyzing correlations. 
We entered the between-groups matrix with ones in the diagonal, 
used an option in the program to instruct the computer that the 
"estimated communalities" were the vector of ^ , and called for a 
principal-factor analysis with those communalities. The result was a 
principal-components solution for the between-groups covariance matrix. 
The same method was used for tha within-groups matrix, with the vector 
of 1 - n 

3 

P- 9.21 I noted earlier that product terms can be added, particularly to 

enter global variables in lower-level equations. Thus (9.5) could become 

(9.5a) Y - Y e .(X, -X ) + 8 . ,(X, -X, )X, 

P ^ Pl Ip plc2 Ip Ic 2c 

These two predictors are uncorrelated . A positive 6 , ^ implies that the 

plcz 

abler teacher is associated with a steeper Y-on-X regression within the 
class. Even though analysis is based on individuals it is the number of 
teachers that regulates the sampling error of this coefficient. 
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10* The Road Ahead 
This report has been chiefly concerned with educational research in 
which data are collected on students with classrooms or on classes within 
districts. The difficulties noted in a great variety of commonplace studies 
(Table 10. 1) imply a need for new strategies of design and interpretation. 
It is not clear to me just which of those difficulties will trouble 
Investigators in other fields ~ for example, students of voting 
behavior or of reference-group theory; my impression is that their difficulties 
include those treated here plus additional ones. 

I started with a concern for a somewhat specialized kind of research, 
the study of Aptitude x Treatment interaction. That kind of inquiry has in 
the past examined overall regressions of outcome on aptitude for pools of 
students assembled from many classes. According to the analysis here, no 
meaningful question is being asked about interactions in group instruction 
unless between-group and within-group regressions are considered separately. 
The implications of my explorations extend far beyond the studies of inter- 
actions, however. 

Every educational or sociological study that attends directly to regression 
coefficients and correlations, including studies using structural-equation 
models, must be thought through in the light of the argument I have presented. 
Sometimes a traditional analysis looking only at overall relationships or 
between-group relationships will prove to be adequate for the purposes of the 
investigation. More often. I suspect, the- investigator will find that a more 
complex decomposition will add to his store of information. But it will also 
make him painfully sensitive to the vagaries of results ~ no matter how 
analyzed - when only a limited number of groups are sampled. And it begins 
to appear that even the best of analyses will leave the causal interpretation 
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Table 10.1. Kinds of investigation within education, psychology 
and sociology where difficulties have been identified 

Page reference in this report 



Analysis of covariance 1.3a f f . , 1.17a, 8. Iff. 

Aptitude X Treatment interaction 2.1 ff,, 3.21, 5.1 ff. 

Attenuation, correction for 1.7, 6.1 ff. 

Classroom climate, assessing 9.18 

Context effects 1.16 ff., 1.22 ff., 3.16 

Correlation, simple 2.12, 9.1 

Evaluating treatments 1.3a ff., 1.13, 1.15, 1.18, 2.17, 

8.3 , 8.8, 8.11 ff . 
Factor analysis and component analysis 9.6 ff . 

Item analysis in test construction 9.19 

Multiple regression 1.7, 9.20 

Partial correlation 9.22 

Path analysis 2.19, 3.17, 9.24 

Placement rules, development of 2.8 

Predicting scores within aggregates 1.12, 4.9 

Regression, comparing across levels 1.17a, 3.16, et passim . 

Reliability studies 6.1 ff. 

School-effects studies 1.3a, 2.17 ff., 9.20 

Social-area analysis 1.23 
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of results equivocal, unless assignment to groups was under the control of 
the investigator. 

For the most part, I have made the following assumptions: 

1. Every member participates in just one collective at the next 
higher level. 

2. . At successive levels of aggregation collectives are completely 

nested. 

3. Aggregates at the highest level are random and membership at 
lower levels is fixed. 

4. A member has no direct effect on scores of a collective to 
which he does not belong. 

5. Data are complete at the lowest level of disaggregation. 

6. There is a knovm causal order, X preceding Y. 

The chief difficulties identified in educational studies that fit 
these assumptions are as follows: 

1. There is no warrant for direct generalization from groups 
formed in one way to groups assembled in some other manner. 

This requires revision of previous thinking about the classifica- 
tion of students on the basis of aptitude (Cronbach & Gleser, 1957; Cronbach & 
Snow, 1976). An investigator who applies a treatment to a number of individuals 
separately will identify a certain regression of outcome on aptitude, and may 
devise a selection rule so that only promising individuals will receive the 
treatment in the future. If there are two treatments, he may devise a classi- 
fication rule for deciding which treatment an individual is to receive. These 
rules are a suitable basis for future decisions if individuals from the same 
population are to be treated individually. If they are to be assembled into 
instructional groups, however, the overall regression of outcome on aptitude 
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need not be the same as before; moreover, the possibility of distinct within- 
and between-group relations should be taken into account in the decision rule. 
The policy based on the individual-level investigation provides only a 
tentative hypothesis about group-level instruction. The same argument applies 
if the original study uses group instruction. If the initial groups ^re 
assembled by a certain rule, the findings apply to future groups formed in 
the same manner. If future groups are formed in a different way, the old 
conclusions do not apply directly. In the short run, it appears necessary 
to insist on fresh validation research when persons are taught (or carry out 
tasks) in groups and the basis for forming groups is modified. In the long 
run, studies of groups formed in different ways might develop a theory that 
would permit reasonable predictions about kinds of groups that have not been 
directly investigated. 

2. Most experimental studies carried out in classrooms have been 
analyzed by means of "individual level" (overall) statistics, with classes 
ignored. The between-groups regression of outcome on aptitude is likely to 
differ from that within groups; the overall analysis combines the two kinds 
of relationship into a composite that is rarely of substantive interest. 

Individual-level analysis may be undertaken as a 
deliberate choice. Analysis of pooled individual data is warranted when: 

a) The investigator (as in much survey research) is interested 

who are 

in a composite description of individuals mixed into groups in the population 
as they are in the sample, and not in a causal interpretation of group- 
related effects; or 

b) The investigator is prepared to assume that any causal effects 
associated with group membership are trivial in magnitude; or, 

c) Known conditions of group formation make the overall 
statistic n comparatively efficient estimator of a between-groups parameter. 
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3. A conscious decision, rooted in the theoretical background or 
the practical context of the investigation, is requir^ed^to identify the 
appropriate units for sampling, assignment to treatment, and analysis. This 
is true in factor analysis, item analysis and empirical keying of tests, 
prediction of scores for lower-level units, e.c. An individual-level analysis, 
a between-groups analysis, and a within-groups analysis address substantively 
different questions and usually give different results. 

4. UTien analysis of covariance is used to compare instructional 
treatments, and data have been collected by using an intact collective 

(e.g.. the school) as the primary sampling unit and the unit of assignment, 
the theoretically appropriate adjustment appears to be that given by the 
regression coefficient across such collectives .(perhaps within blocks). 
In non-random experiments, analysis at some other level may give a consider- 
ably different estimate of the adjusted treatment effect. 

To determine a between-classes or a between-schools regression 
coefficient with reasonable precision one must have a much larger sample of 
such units than is normally available to the experimenter. Consequently, 
his adjustment may be heavily influenced by sampling errors. 

5. In the study of aptitude-treatment interactions, particular 
interest attached to the between-classes regression coefficient, because 
instructional treatments will most often be assigned to whole classes. As 
has been said, it is rarely practical to determine such a regression 
coefficient with precision. 
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This fact, together with the hazard in generalizing to groups formed by 
assembly rules, seems to discourage altogether a sheer empirical search for 
ATI in classroom instruction. 
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6. , For the within-groups coefficient, the sampling error depends 
on the number of classes and the variability of the specific within-groups 
coefficients. The customary manner of examining such sampling errors ignores 
that variability. Sometimes the estimate of the pooled within-groups 
coefficient will be undependable even though hundreds of individuals provide data. 

7. No secure causal interpretation can be given to the between-groups 
regression coefficient, the within-groups coefficient, or their difference. 

The simple interpretations regarding "context effects" and "school effects" 
that have been made in the past are not defensible. 

At least three causal processes may affect regression slopes: direct 
effects of the individual's characteristics on his performance, competitive 

effects or other differentiation of experience arising from the heterogeneity 
of individuals within groups, and processes that raise or lower the outcomes 
for groups with high (low) standing on the predictor variable. The first 
two of these are confounded in the between-groups coefficient, and the latter 
two affect the within-groups coefficient. There is no direct way, then, to 
evaluate the magnitudes of the three effects from the two observed coefficients 
(especially as the two components affecting a regression coefficient may work 
in opposite directions). 

8. Even when only strictly individual causes operate, the between- 
groups and within-groups regression coefficients will generally differ from 
that calculated for individuals pooled, because of demographic differences 
among the groups. These differences may reflect social processes, or aggre- 
gation rules of Che data collector. This creates great ambiguities in 
interpreting differences between parameters at two levels of aggregation, 

or differences between parameters for two treatment populations that were 
aggregated into groups by different rules. 
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9. Studies of regression coefficients, and analyses of covariance 
in which regression coefficients play a part, cannot be soundly interpreted 
without due consideration of errors of observation of the predictor variable. 
Direct comparison of an observed between-groups regression coefficient with 
an observed within-groups coefficient is surely unsound. Observed regression 
coefficients will not be patterned like the coefficients for universe scores; 
yet the latter are of more fundamental interest. Contrary to the usual 
belief, group-level information will not necessarily have a smaller standard 
error and a smaller coefficient of generaiizability than information on 
individuals within groups. 

The information that would be required to evaluate the 
reliability (generaiizability) of group-level data has not been collected 
in studies to date. Indeed, the theory for such studies has barely begun 
to evolve. Yet - let me repeat - without proper disattenuation one 
arrives at incorrect answers to the kinds of questions educational research 
workers and sociologists have been trying to study. 

Sweeping recommendations cannot be made because of our present ignorance 
and because tactics must be suited to each substantive problem. The following 
suggestions are derived in part f,,, „hat others have written. 

1. l-hore should be a deliberate attempt to conceptualize the processes 
operating when persons are treated in groups or hierarchical structures formed 
on a certain basis. Once a process is postulated one can hope to suggest 
indicators of intermediate events that will give information on the strength 
of the process. Such more complete specification of the modelwill be 
required to get data that warrant causal interpretation. 
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2. The process by which individuals are assigned to (or voluntarily enter) 
groups should be specified as fully as possible. In general, when group 
membership is under full control of an investigator whether he uses a 
random process or groups persons having specified characteristics — 
interpretation of findings will be freed of many of the uncertainties that 
arise with self-selected groups. 

Where the collectives already exist, there should be a careful description 
of the way collectives differ. These discrimiuating. variable's ^are^ a part- of - - - - 
the causal process by which group effects are generated, and they may create 
noncausal relationships at the group level. 

3. Some amoun*: of direct experimentation on context effects should be 
carried out, to supplement and lend supporting insight to the correlational 
studies under field conditions. It should be instructive, for example, to 
investigate how learning differs, cognitively and affectively, when persons 
work alone and when similar persons work in a group composed in one or 
another manner. To experimentally disentangle effects of group composition, 
of group differences in the treatment delivered, of within-group differences 
in experiences, and of individual differences properly speaking would inform 
both methodologists and theorists. Such studies will necessarily be limited 
in time and size, and cannot answer questions about cumulative effects of 
group experience. 

4. In an experiment or survey. It would often be advisable at the first 
stage of sampling to select 

units at the highest level at which causal processes of interest operate. 
In an t^xper iment > it is those units that should be assigned to treatment. 
Thus, in a study of school desegregation, it ?eems reasonable to cake the 
ity as the sampling unit. This is true not only because desev;regatic 
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plans are typically district-wide but also because the attitudes of patrons 
in one school are likely to influence their fellow townspeople. An exception 
to this largest-unit principle is noted for static descriptive studies such 
as public-opinion polls; if the interest is only in the population average 
or some similar statistic, conventional sampling of small units is efficient. 
Another qualification: In contrasting treatment effects at one level it may 
be efficient to select random collectives at the next higher level, and then to 
divide the members of each one among treatments. 

Once the large unit is chosen, it may be sensible to sample smaller units 
within it: schools within a district, classrooms within a school, or students 
within a classroom. Such multistage sampling has to be designed to fit the 
purposes of the particular investigation (Jaeger, 1970). For the kinds of 
studies discussed in the report, it will almost always be more important to 
increase the number of schools or classrooms. It may be appropriate to 
test only a fraction of the individual students, if that economy permits an 
increase in the number of groups. 

In obtaining a between-collectives regression coefficient, there are 
advantages in an extreme-groups design, most or all of the data being taken 
from groups whose means are far out on the predictor scale. This design has 
superior cost-effectiveness for evaluating a univariate regression equation. 
I do not suggest, however, that groups be formed by assembling individuals 
with extreme scores; such groups are not appropriate representatives of the 
reference population of collectives formed in a more normal manner. 

5. In most studies, it will be impractical to collect extensive data on a 
large sampJe of higli-level units. Not many investigators will be In a 
position to investigate 100 school systems, or even 100 classes. Research 
on higher-level units, then, will have to be more in the character of ca.se 
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studies, and less in the character of statistical studies. This point was 
made by Merton and Kitt (1950) in one of the first modern papers on group 
effects, but it seems to have dropped from sight during subsequent attempts 
to draw conclusions about context effects from survey-like data. 

6. The mode of analysis of effects is to be determined by the substantive 
model suggested for the processes at the several levels. It will often be 
appropriate to form separate structural models for between-groups relations 
and for within-groups relations, taking into account the rules by which 
collectives are formed. 

7. Plots of the data should be made, to the greatest degree possible. 
Repeatedly, bivariate plots of group means have improved my interpretation of 
betwoen-group statistics (often by inducing caution). Studies of groups 
will usually be limited in size, and outliers can make a large diff- 
erence in the statistics. 

^- In predicting scores of members of collectives, it will generally be 
sound to estimate the dependent variable for the collective and then add the 
predicted deviation of the member's score from that group value, instead of 
making a one-stage prediction. 

Even if the two-stage prediction accounts for rather little additional 
variance, it may make statements about atypical individuals that differ 
appreciably from the simple prediction. 
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I have adopted the working hypothesis that treating persons in groups 
does make a difference. The makeup of a group may determine the events that 
impinge on the group as a whole, and may condition the events that impinge 
on the individual group member or on his perception of them. Some investi- 
gators will prefer a simpler working hypothesis that tries to explain the 
world without reference to group effects. No doubt group effects are 
negligible in some instruction and in some social processes, but even the 
investigator who prefers to deny their existence will be wise to allow his 
data to speak on the point, to the extent that a design of modest size can 
give information. It is intellectually legitimate to adopt a strong model 
that assumes absence of group effects, but this is likely to be a poor 
strategy in any substantive field where one has insufficient experience to 
make the assumption persuasive. 

The issues that have come to light in this paper lead me to think that 
educational studies conducted in classrooms or with data from schools and 
school districts have almost never been analyzed correctly. Those few 
investigators who have takkn the collective as the unit of analysis have 
rarely brought to the surface the potentially interesting within-group 
information. Moreover, survival of the null hypothesis with groups as the 
unit of analysis must often have been given a substantive interpretation 
without realization that the sample was insufficient to make the study 
informative. Per contra, use of the individual as unit of analysis when data 
are collective is likely to reject null hypotheses falsely. Descriptive as 
well as inferential statistics obtained by analyzing data on collectives at 
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the individual level are open to misinterpretation, except where the 
interpreter realizes that he is looking at a composite figure for an 
assemblage of groups. 

Oddly, the history of the aggregation problem in sociology, politics, 
and psychology has been one of regarding individual-level relationships as 
the information of primary interest, and group-level relationships as a 
distorted shadow of the former. Once the conventional individual-level 
information is seen as a composite of within-groups and between-groups effects 
(at least for certain variables and certain collectives), the situation is 
nearly reversed. The conventional mixture of the two effects is not usually 
the most meaningful variable to enter into hypotheses. 

Not all studies in collectives will move in the direction emphasized in 
this report. There probably are kinds of investigation (factor analysis of 
reading-readiness tests being one) where the question can best be posed at 
the individual level even though data come from collectives. Even in such 
studies » however, the investigator would do well to ponder the proposition 
that if he randomly samples one student from each school in a large area he 
will get a different result than if he includes all the students in all the 
schools, or confines his analysis to students in a single school. 
Methodology should be matched to the substantive context; for 

example, factor analysis , of the Learning Environment Inventory would seem 
particularly to call for distinguishing relations between and within groups. 

The methodological maxim appears to be that an investigator who collects 
data on collectives ought to take an explicit position on the role of 
betwecn-group, within-group, and individual-without-regard-to-group effects 
in the variables he studies. He may opt deliberately for any one of several 
cinaly.ses, but he should not back into one of the analyses merely because it 
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is commonplace or convenient. Perhaps an investigator will wish to leave 
the question open, and analyze the data at several levels. This is to be 
encouraged so long as each analysis is logical, and the interpretation is 
ecological. To regard analyses at two levels as alternative ways to answer 
the same question is rarely if ever justified. 

I have suggested elsewhere (Cronbach, 1975) that the ideal of establish- 
ing lawlike, lasting relationships in the social sciences may be unapproach- 
able. In that paper I was focussing on the study of the psychology of 
individuals. This report makes the difficulties seem even more forbidding. 
A social science must deal with collectives, and the cost of obtaining 
data on collectives is great. It appears that the only recourse is to 
make more use of the data v/e can afford to collect, appreciating hints 
in the data with due regard for their uncertainty, and enriching our 
quantitative summaries with awareness of the qualitative context of the 



events. 
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